Keys Only DB?

Printable View

05-02-2011, 05:15 PM
MilesAhead

Keys Only DB?

I have only programmed access to db engines on occasion.
But someone asked a question that sounded like a "worst case
scenario" for a db engine. Namely the person has a word list,
like a spelling dictionary. Due to lack of maintanence it
has grown large with many duplicates. Since usually one thinks
of looking up a chunk of data by the key, it would seem that a
"keys-only database" would be slow and unwieldy. Have you run
across a similar application? The dictionary is a flat text
file with a word per line. The size has grown many times that
of physical ram on the machine. Therefore attempting to sort
as a means to remove duplicates would take days if not weeks of
merge sort.

Do you have any suggestions or approaches? I assume the person is running Windows XP. I suggested a Windows port of the Linux sort commands. Started off fine but after 6 or 7 days and still churning the user gave up and killed it. :)
05-02-2011, 07:34 PM
skhanal

Can you try to run sort on a chunk of data instead of trying to do it over the complete set?
05-03-2011, 01:09 AM
MilesAhead

I appreciate the idea. Mostly I'm posting because I'm convinced at some point somebody must have done a study in this worst case that gives an idea of the overhead. Say if one added all the keys to a db if it would be 2x or 3x or 4x the size of the original raw data etc..

I mean Codd did this stuff decades ago. Somebody must have at least some paramaters how inefficient it would be. Everyone thinks add it to a db and it will automatically be fast and efficient. But there must be some criteria where a db is actually worse than just writing a hack.
05-04-2011, 12:25 AM
Gil-Galad

Petenltially use hashing to create subsets, and look at these seperately? As duplicates would have the same hash you would be guaranteed that they would be in the same partition.