Results 1 to 4 of 4

Thread: Keys Only DB?

  1. #1
    Join Date
    May 2011
    Posts
    2

    Keys Only DB?

    I have only programmed access to db engines on occasion.
    But someone asked a question that sounded like a "worst case
    scenario" for a db engine. Namely the person has a word list,
    like a spelling dictionary. Due to lack of maintanence it
    has grown large with many duplicates. Since usually one thinks
    of looking up a chunk of data by the key, it would seem that a
    "keys-only database" would be slow and unwieldy. Have you run
    across a similar application? The dictionary is a flat text
    file with a word per line. The size has grown many times that
    of physical ram on the machine. Therefore attempting to sort
    as a means to remove duplicates would take days if not weeks of
    merge sort.

    Do you have any suggestions or approaches? I assume the person is running Windows XP. I suggested a Windows port of the Linux sort commands. Started off fine but after 6 or 7 days and still churning the user gave up and killed it.

  2. #2
    Join Date
    Nov 2002
    Location
    New Jersey, USA
    Posts
    3,932
    Can you try to run sort on a chunk of data instead of trying to do it over the complete set?

  3. #3
    Join Date
    May 2011
    Posts
    2
    I appreciate the idea. Mostly I'm posting because I'm convinced at some point somebody must have done a study in this worst case that gives an idea of the overhead. Say if one added all the keys to a db if it would be 2x or 3x or 4x the size of the original raw data etc..

    I mean Codd did this stuff decades ago. Somebody must have at least some paramaters how inefficient it would be. Everyone thinks add it to a db and it will automatically be fast and efficient. But there must be some criteria where a db is actually worse than just writing a hack.

  4. #4
    Join Date
    Feb 2011
    Location
    Melbourne, Australia
    Posts
    13
    Petenltially use hashing to create subsets, and look at these seperately? As duplicates would have the same hash you would be guaranteed that they would be in the same partition.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •