Joins and Data Warehouse

**tmasc** · 06-28-2006, 03:34 PM

{This message also posted in the Oracle forum}

To follow the discussion, it's probably easier if you understand sports.

In baseball, every event is tracked. In addition to events, you have a configuration (the identies of the players, the park, etc). If this was simply one game, I would create the tables in 3NF, and do the joins, and be on my way. However, I'm going to end up with a few million records, with the main key being the gameid (of which there'd be a few thousand).

Suppose I'm interested in knowing who each player plays with at each event. (In a game, there's about 10-15 players that play for each team, so you are talking about getting a result back of games x events x 25squared.)

My issue. Since I process each game individually, would it make more sense to do the join at that level? That is, my choice is:
1 - read in 10,000 games, with 300 events each (= 3 million rows)
2 - load them in 3NF tables
3 - do massive join (resulting in 10,000 x 300 x 625 = 2 billion rows)
OR
1 - read in 1 game, 300 events each in 3NF tables
2 - do massive join (resulting in 200,000 rows per game)
3 - load in redundant tables, clear out the 3NF tables, and go to step 1 (doing this 10,000 times, for a total of 2 billion rows)

The important thing to remember is that the minimum join key is the gameid. If I've got millions of records, but I know off the bat that almost all won't satisfy the self-join condition
(e1.gameid = e2.gameid), wouldn't it make more sense to do the self-join upfront, before loading the data?

Thanks, Tom

Thread: Joins and Data Warehouse

Thread Tools

Display

Threaded View

Joins and Data Warehouse

Posting Permissions