Results 1 to 2 of 2

Thread: Joins and Data Warehouse

Threaded View

  1. #1
    Join Date
    Jul 2004
    Posts
    17

    Joins and Data Warehouse

    {This message also posted in the Oracle forum}

    To follow the discussion, it's probably easier if you understand sports.

    In baseball, every event is tracked. In addition to events, you have a configuration (the identies of the players, the park, etc). If this was simply one game, I would create the tables in 3NF, and do the joins, and be on my way. However, I'm going to end up with a few million records, with the main key being the gameid (of which there'd be a few thousand).

    Suppose I'm interested in knowing who each player plays with at each event. (In a game, there's about 10-15 players that play for each team, so you are talking about getting a result back of games x events x 25squared.)

    My issue. Since I process each game individually, would it make more sense to do the join at that level? That is, my choice is:
    1 - read in 10,000 games, with 300 events each (= 3 million rows)
    2 - load them in 3NF tables
    3 - do massive join (resulting in 10,000 x 300 x 625 = 2 billion rows)
    OR
    1 - read in 1 game, 300 events each in 3NF tables
    2 - do massive join (resulting in 200,000 rows per game)
    3 - load in redundant tables, clear out the 3NF tables, and go to step 1 (doing this 10,000 times, for a total of 2 billion rows)


    The important thing to remember is that the minimum join key is the gameid. If I've got millions of records, but I know off the bat that almost all won't satisfy the self-join condition
    (e1.gameid = e2.gameid), wouldn't it make more sense to do the self-join upfront, before loading the data?

    Thanks, Tom
    Last edited by tmasc; 06-28-2006 at 09:07 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •