-
Keys Only DB?
I have only programmed access to db engines on occasion.
But someone asked a question that sounded like a "worst case
scenario" for a db engine. Namely the person has a word list,
like a spelling dictionary. Due to lack of maintanence it
has grown large with many duplicates. Since usually one thinks
of looking up a chunk of data by the key, it would seem that a
"keys-only database" would be slow and unwieldy. Have you run
across a similar application? The dictionary is a flat text
file with a word per line. The size has grown many times that
of physical ram on the machine. Therefore attempting to sort
as a means to remove duplicates would take days if not weeks of
merge sort.
Do you have any suggestions or approaches? I assume the person is running Windows XP. I suggested a Windows port of the Linux sort commands. Started off fine but after 6 or 7 days and still churning the user gave up and killed it. :)
-
Can you try to run sort on a chunk of data instead of trying to do it over the complete set?
-
I appreciate the idea. Mostly I'm posting because I'm convinced at some point somebody must have done a study in this worst case that gives an idea of the overhead. Say if one added all the keys to a db if it would be 2x or 3x or 4x the size of the original raw data etc..
I mean Codd did this stuff decades ago. Somebody must have at least some paramaters how inefficient it would be. Everyone thinks add it to a db and it will automatically be fast and efficient. But there must be some criteria where a db is actually worse than just writing a hack.
-
Petenltially use hashing to create subsets, and look at these seperately? As duplicates would have the same hash you would be guaranteed that they would be in the same partition.