I am trying to modify a survey DB schema created by OOP(Object Oriented Programming) snobs who had their heads so far up their ass(in my opinion).
This is current schema created by them.
Actually there aren't a several foreign keys as they are shown in the picture, but that's what they(OOP snobs) are doing to the table "Response". They are inserting question_collection_id and its name as well as question_id and its name to Response table. However, it was not ill-intended, "We just wanted to load the report quickly"(OOP snobs).
But I really recommend this schema, Because of its simplicity.
Yet I haven't proven that my schema will provide report just as quickly as OOP snobs' version. As mentioned earlier, since they don't have foreign key constraints accordingly, I do find quite number of discrepancy on values of "question" columns between "Question" and "Response" tables and "question_collection" columns between "question_collection" and "Response" table. I can provide schedule task to eliminate those problems. But I really prefer my schema than theirs. So far, I can explain mine is better just because of its simplicity and DBA's gut instinct. I'd like to hear Superior Expert's opinion which schema is better and why one is better than the other.
Last edited by DBAStriker; 12-12-2004 at 07:10 PM.
I have to agree with your assessment of where their heads were!!! Their database structure is very bad. Yours is much better.
There is more to consider than just returning the report. The tables get locked while inserting and updating the table, so bad design can affect output in that manner as well.
If you post sample queries run for the reports, we could probably provide you some fuel for your argument.
Thanks for your input, RawHide
What I've shown you earlier was still simple version of what OOP snobs did. But actual thing is far more horrific.
Following is their original. http://forums.databasejournal.com/clear.gif
Object Oriented Program has gone wild in this. They wanted to reuse "question" in "question_collection", They wanted to reuse "variable" in "question". And I checked the record to if any items have been reused.... Guess what? It never happened!! So with this, if I want to retrieve survey data(In question_collection_type contains data like "banner click count", "movie view progress stats", "survey"), query look like this.
Code:
SELECT question_collection.question_collection_name,
question.question_name,
customer.first_name,customer.last_name,
response.response_value,
response.responded_date
FROM question_collection INNER JOIN question_collection_type ON question_collection.question_collection_type_id = question_collection_type.question_collection_type_id
INNER JOIN question_collection_question ON question_collection.question_collection_id = question_collection_question.question_collection_id
INNER JOIN question ON question_collection_question.question_id = question.question_id
INNER JOIN variable ON question.variable_id = variable.variable_id
INNER JOIN response ON response.variable_id = variable.variable_id
INNER JOIN customer ON response.customer_id = customer.customer_id
WHERE question_collection_type.question_collection_type = 'survey' and
question_collection.question_collection_id ='xxxx-xxxx-xxxx-xxxxxx-xxxx'
No shit, the report wasn't loading quick! We've got 7 tables joined here.
So, those OOP snobs changed the "response" table to make the report load faster. Which was shown in previous post, but again, this is close to the truth.
SELECT question_collection.question_collection_name,
question.question_name,
customer.first_name,customer.last_name,
response.response_value,
response.responded_date
FROM response INNER JOIN customer ON response.customer_id = customer.customer_id
INNER JOIN question_collection ON response.question_collection_id = question_collection.question_collection_id
WHERE question_collection_type.question_collection_type = 'survey' and
question_collection.question_collection_id ='xxxx-xxxx-xxxx-xxxxxx-xxxx'
But as we agreed, I hate having "question_collection_id","question_id", etc.... double whammy crap in "response". We are not re-using variable, questions, question_collections anyway. And also I discovered interesting thing. Currently "response" table holds 1267999 rows and 22571 data pages. Then I copied that table with exact amount. Then modified the table with my schema(http://forums.databasejournal.com/clear.gif). It reduced 5992 data pages! Isn't that something? So far my main arsenal against OOP snobs to convince them to modify with my schema is :
"Reponse" table is fast data growing table, With current schema it quickly fill up data storage. To keep data growth minimum is possible, using my schema is inevitable.
But I am still not so sure how I defend myself on report load. With my schema is simple, but it would still require some table joins. Do you think I should create another tables just to store normalized version of "response" table?
I prefer using integer as primary keys. I am not totally against it, but OOP snobs Looooooove uniqueidentifier.
In their schema http://forums.databasejournal.com/clear.gif, all tables except "customer" and "question_collection_type" use uniqueidentifier as primary key.
Also, why avoid searching varchar field? Even if it's indexed?
Also, I noticed significant data page decrease changing from datetime datatype to smalldatetime. I think that it is important to keep data growth minimum as possible for table especially like "Response".
Is uniqueidentifier quite expensive for data storage? It looks like it is more expensive than integer for sure....
Searching on any type of string field is very slow, even if it's indexed.
A uniqueidentifier uses 16 bytes of memory wheras an int only uses 4. More importantly than the space it uses, is that it is considerably slower than using an int field (see below). So you have to ask them which is more important to them, that their ID's are unique in the whole world (something that they probably think is "cool) or performance.
According to BOL:
The main advantage of the uniqueidentifier data type is that the values generated by the Transact-SQL NEWID function or the application GUID functions are guaranteed to be unique throughout the world.
The uniqueidentifier data type has several disadvantages:
The values are long and obscure. This makes them difficult for users to type correctly, and more difficult for users to remember.
The values are random and cannot accept any patterns that may make them more meaningful to users.
There is no way to determine the sequence in which uniqueidentifier values were generated. They are not suited for existing applications that depend on incrementing key values serially.
At 16 bytes, the uniqueidentifier data type is relatively large compared to other data types such as 4-byte integers. This means indexes built using uniqueidentifier keys may be relatively slower than implementing the indexes using an int key.
In "response" table, I am thinking of using sql_variant data type instead of nvarchar(4000) in "response_value" column. Reason is that questions vary from asking for details description to selecting single select multiple question with integer value associated with it.
After my analysis, it turned out that it is slightly more expensive with sql_variant to store than nvarchar(4000), but query performance looked to be significantly robust with sql_variant data type.
Did you have any pro&con experience involving with sql_variant data type?
Last edited by DBAStriker; 12-15-2004 at 04:48 PM.
Can't really say. I've never found a good reason to use sql_variant. I don't really see any real benefit from using it, but it has a lot of special rules that make using it a pain in the rear.
I've created 2 indentical table with 1000000 rows of text values and 1000000 rows of integer values.
Only difference between 2 tables is that one table stores those value as nvarchar(4000) and the other stores as sql_variant.
Then inserted 1000000 rows of integer and 1000000 rows of text data to both table.
Then tested following queries 4 times.
Code:
SELECT *
FROM test_1
WHERE data_value = '12345'
SELECT *
FROM test_2
WHERE data_value = 12345
SELECT *
FROM test_1
WHERE data_value = 'John'
SELECT *
FROM test_2
WHERE data_value = 'John'
Tell me this, what results do you get when you run the following queries against the tables. Not performance results, but the actual response from the query?
SELECT count(*)
FROM test_1
WHERE data_value > 'John'
SELECT count(*)
FROM test_2
WHERE data_value > 'John'
You see a slight discrepency there, don't you???
The rules for how numeric data is handled have suddenly been reversed.