dcsimg
Results 1 to 10 of 10

Thread: database design question - a lot of fields or a lot of tables?

  1. #1
    Join Date
    Jun 2015
    Posts
    5

    database design question - a lot of fields or a lot of tables?

    Hello. I'm looking for help with a database design problem...

    I have 1300 data items, 3300 company tickers, and dates (making this time series data).

    The data items grow by a few a month. The company tickers grow by one or two per week. And dates that grow by one per day.

    The data items can be broken up into logical groups given how they are used in our quant model. There are 10 of these groups.

    We query this data in many ways. These are the most popular I think:

    Single Stock - 50 or so data items for a given ticker
    Subset of Stocks - 5 to 100 data items for a subset of about 3 to 2000 tickers
    Single Stock Time Series - a few years of historical daily data on 3 to 10 data items at a time for a single stock

    In your opinion, how should I design this so it optimizes my queries?

    I'm thinking of having one table per ticker per data item group: 3300 tickers x 10 data item groups = 33,000 tables Which would leave roughly 50 to 200 fields per table (depending on the group) like this

    tblTicker1-DataItemGrp1

    table records
    date1
    date2
    date3
    ...
    dateN

    table fields
    dataItem1
    dataItem2
    dataItem3
    ...
    dataItem1300

    Up to: tblTicker3300-DataItemGrp10

    (I am using numbers and generic names instead of more informative names for simplicity.)

    However 33,000 tables sounds like a lot!! However the alternative is to have a ton of fields, which I am told is super inefficient. Or should I just go with super wide tables with 1300 fields or more, but less tables? What do you think?

    Thanks a ton! Happy to get any feedback even if it is a quick answer/suggestion.

  2. #2
    Join Date
    Nov 2002
    Location
    New Jersey, USA
    Posts
    3,907
    If I understand it correctly you have 1300 data elements and you have data for 3300 companies. The basic of data modeling is to identify the entities and attributes corresponding to that entity. The data element you talked about are attributes of entities. Each entity translates to one table with attributes as columns. You then need to to see if there are any dependencies between attributes and break them into separate tables to meet Normal Form requirements. The goal is to make your design meet at least Third Normal Form.

    Yes, 33000 tables sound too many, even commercial ERP systems do not have that many tables. If you can get your design to Third Normal form, you should be in a good shape.

  3. #3
    Join Date
    Jun 2015
    Posts
    5
    Quote Originally Posted by skhanal View Post
    If I understand it correctly you have 1300 data elements and you have data for 3300 companies. The basic of data modeling is to identify the entities and attributes corresponding to that entity. The data element you talked about are attributes of entities. Each entity translates to one table with attributes as columns. You then need to to see if there are any dependencies between attributes and break them into separate tables to meet Normal Form requirements. The goal is to make your design meet at least Third Normal Form.

    Yes, 33000 tables sound too many, even commercial ERP systems do not have that many tables. If you can get your design to Third Normal form, you should be in a good shape.
    Yes, 3NF is where it is at; however, I think I am there.

    Here is small sample to company/entity example:

    tblIBM-Value <-company/entity specific "IBM" and group specific "Value" group table
    One row per date (daily)
    One column per entity attributes:
    Price/Book <- a value for IBM
    Price/Earnings <- a value for IBM
    GICS Sector < - a foreign key to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Value group

    tblGE-Value <-company specific "GE" and group specific "Value" group table
    One row per date (daily)
    One column per entity attributes:
    Price/Book <- a value for GE
    Price/Earnings <- a value for GE
    GICS Sector < - a foreign key to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Value group

    tblIBM-Growth <-company/entity specific "IBM" and group specific "Growth" group table
    One row per date (daily)
    One column per entity attributes:
    Price/Book <- a value for IBM
    Price/Earnings <- a value for IBM
    GICS Sector < - a foreign key to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Growth group

    tblGE-Growth <-company/entity specific "GE" and group specific "Growth" group table
    One row per date (daily)
    One column per entity attributes:
    Earnings Growth YoY <- a value for GE
    Revenue Growth YoY <- a value for GE
    GICS Sector < - a foreign key to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Growth group

    ...and so on for the 8 remaining groups discussed originally. The Value and Growth groups are above, leaving 8.

    That seems normalized to me. Do you agree?

  4. #4
    Join Date
    Nov 2002
    Location
    New Jersey, USA
    Posts
    3,907
    I would combine all Value tables into one by adding a Company as a column. Same with all Growth data into one table with Company as a column.

  5. #5
    Join Date
    May 2006
    Posts
    407
    Quote Originally Posted by skhanal View Post
    I would combine all Value tables into one by adding a Company as a column. Same with all Growth data into one table with Company as a column.
    As soon as I saw table names like IBM, GE, etc. the first thing I thought of was "3rd Normal Form"? I don't think so. Yes, Company NEEDS to be a field name in one table. I'm also wondering how much duplication of data there will be between the growth table and the Value table. I'm going to assume a lot of duplicate data.

    I would highly recommend you find someone that truly knows how to design a relational database and pay them a little bit of money to help you design this system. It will be worth every penny you pay them if they get it right. Therefore, make sure they know database design before you hire them.

  6. #6
    Join Date
    Jun 2015
    Posts
    5
    GolferGuy - what would be duplicated in the Value and Growth tables? I can't think of a single thing that would be duplicated. Wait, now that I am looking at my last post I see where you may have been lead astray. This is what I should put as my sample:

    tblIBM-Value <-company/entity specific "IBM" and group specific "Value" group table
    One row per date (daily)
    One column per entity attributes:
    Price/Book <- a value for IBM
    Price/Earnings <- a value for IBM
    GICS Sector < - ID to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Value group

    tblGE-Value <-company specific "GE" and group specific "Value" group table
    One row per date (daily)
    One column per entity attributes:
    Price/Book <- a value for GE
    Price/Earnings <- a value for GE
    GICS Sector < - ID to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Value group

    tblIBM-Growth <-company/entity specific "IBM" and group specific "Growth" group table
    One row per date (daily)
    One column per entity attributes:
    Earnings Growth YoY <- a value for IBM
    Revenue Growth YoY <- a value for IBM
    GICS Sector < - ID to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Growth group

    tblGE-Growth <-company/entity specific "GE" and group specific "Growth" group table
    One row per date (daily)
    One column per entity attributes:
    Earnings Growth YoY <- a value for GE
    Revenue Growth YoY <- a value for GE
    GICS Sector < - ID to a GICS Sector name table (which is not counted in my 33,000 table count)
    ...etc. to the N'th data item/column for the Growth group


    GolferGuy & skhanal - Are you suggesting putting all of the data from all tickers in one table per group (Value, Growth, etc.)? If so I would have 3300 new rows/records every day and that table would also be 1300 fields wide. Remember this is a daily time series so for every ticker I have one data point to store every day for every data item. That would make for some massive tables.

    skhanal - I have a few colleagues advising. So far the feedback I am getting is that I just have to find ways to size things down or pony up for a huge operation/database. Size down is the only option right now. Still considering what to do exactly....
    Last edited by mountainclimber; 06-04-2015 at 09:03 PM.

  7. #7
    Join Date
    May 2006
    Posts
    407
    MountainClimber-
    You continue to give the same examples of what fields you would have in this massive table for IBM and matching table for GE. One table for "Value" and one table for "Growth" but I am still at a complete loss what in the world are all those other, yet un-named fields in the value and growth tables. If I was an investment advisor I guess I might have a clue, but I'm not in the investment field, nor have I ever dealt with the investment business. Therefore my question, what are these other fields/columns that are needed in these two separate tables? You have this statement 4 times in each of your two examples "...etc. to the N'th data item/column for the Value group" By the way, nothing jumps out at me that is different in your second example (your last posting) from your previous example. And the way you stated it in your last posting I was expecting something new. OK, the bolding was new. I really don't get what the bolding was supposed to do. I saw all those same words the first time. What I'm looking for is what are the data elements that will take up all those columns you are talking about.

  8. #8
    Join Date
    Jun 2015
    Posts
    5
    GolferGuy - Look closely at the first post and the second where I have bold text and you will see a difference the IBM Growth table. I only duplicated this for clarity/correction, not for emphasis.

    What is the data? There are 1300 of them, so just about every type of number you can think of. For example in growth:
    growth of revenue yoy, qoq, yoy 12 month trail, qoq sequential, etc.
    growth of earnings yoy, qoq, yoy 12 month trail, qoq sequential, etc.
    growth of oper. cash flow yoy, qoq, yoy 12 month trail, qoq sequential, etc.
    growth of free cash flow yoy, qoq, yoy 12 month trail, qoq sequential, etc.
    ...and on and on and on...just for growth.

    As you can see it is not really important to the question that I have. They are just numbers of calculations. And for the few fields that have duplication, I know what to do with them.

    The real issue is the dates (inf.) x tickers (3300) x data items (1300) size and how to best handle that for the fastest queries.

    I don't need two separate tables for Growth and Value. I just thought that might be a good way to split up the high number of data fields. But after researching it a bit more, it is not.

  9. #9
    Join Date
    May 2006
    Posts
    407
    BTW, when someone says they have no understanding of the field, yoy, qoq, yoy 12 month trail, qoq sequential are what the data is, but in my experience, understand is very important and the yoy, qoq, etc. do not add to understanding for someone that does not know the field. At least Google knows and will explain it.
    Now, to your problem, and this will be the last I offer. Looks like you want to store the results of a ton of calculations. I'm assuming that if your stored the raw numbers the calculations could then be done rather than storing the answers. But then again, I will also assume that doing the calculations to come up with the 1300 results, per day, would take a long time. Overall, the decision will have to be made somewhere as to how much all of these answers are truly worth. How much analysis can be done with 1300 daily data points, and after all that analysis, can someone really make that much better decision than only using 200, 300 data points? From what I have seen in all sorts of industry, it is the people that can make the analysis from a rational number of data points that actually "get ahead" in the world. Those are the people that can think and understand and "feel" where things are going. The people that need to analyze "everything" seem to never get anywhere except more analysis. What I was trying to find out, and I think I did find out, was why so much data.
    Good luck getting what you need. Remember, moderation in all things.

  10. #10
    Join Date
    Jun 2015
    Posts
    5
    Quote Originally Posted by GolferGuy View Post
    BTW, when someone says they have no understanding of the field, yoy, qoq, yoy 12 month trail, qoq sequential are what the data is, but in my experience, understand is very important and the yoy, qoq, etc. do not add to understanding for someone that does not know the field. At least Google knows and will explain it.
    Now, to your problem, and this will be the last I offer. Looks like you want to store the results of a ton of calculations. I'm assuming that if your stored the raw numbers the calculations could then be done rather than storing the answers. But then again, I will also assume that doing the calculations to come up with the 1300 results, per day, would take a long time. Overall, the decision will have to be made somewhere as to how much all of these answers are truly worth. How much analysis can be done with 1300 daily data points, and after all that analysis, can someone really make that much better decision than only using 200, 300 data points? From what I have seen in all sorts of industry, it is the people that can make the analysis from a rational number of data points that actually "get ahead" in the world. Those are the people that can think and understand and "feel" where things are going. The people that need to analyze "everything" seem to never get anywhere except more analysis. What I was trying to find out, and I think I did find out, was why so much data.
    Good luck getting what you need. Remember, moderation in all things.
    Our strategy has been a top 10 or 15 global macro strategy (depending on how you evaluate it) for 14 years. We believe "the answer" over the long-term changes frequently therefore it is often more complex than is comfortable for the human brain. You have to "know" more and adapt quickly to win. We leverage quant programming (me) and common sense to stay on top and in front of that. Most strategies can't see the forest for the trees. And sometimes you even have to change forests all together. Our systems need the breadth of data to make this possible. I will tell you that "the answer" in the short-term is often simple, but you have to change frequently to "stay right". Simple is comfortable and "right" at any point in time, but wrong soon. Everything in moderation -- including moderation. - Harvey Steiman


    Thank you for trying to help!



    -mountainclimber

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •