Results 1 to 11 of 11

Thread: Improvement to article: Deleting Duplicate Rows in a MySQL Database

  1. #1
    Join Date
    Dec 2002
    Location
    Cape Town, South Africa
    Posts
    75

    Improvement to article: Deleting Duplicate Rows in a MySQL Database

    A reader has just pointed out an improvement to my most recent article

    On page 2, you can of course use:

    CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM bad_table

    instead of:

    CREATE TEMPORARY TABLE
    bad_temp(id INT,name VARCHAR(20))
    TYPE=HEAP;

    INSERT INTO bad_temp(id,name) SELECT DISTINCT id,name FROM bad_table;

    saving yourself doing it in two steps.

    Thanks to P-A Fredriksson. Hopefully the changes will reflect on the site soon.

  2. #2
    Join Date
    Feb 2003
    Location
    Johannesburg, South Africa
    Posts
    145

    Perl Option - no Temp Tables

    You all probably read the article, especially:

    Be careful when using temporary tables in this way. With our tiny data set, there is not much of a problem, but huge tables could cause problems with memory. etc...
    BUT, creating a PERMANENT temporary table on LARGE data sets will move your memory problem from RAM to DISK SPACE. There is also additional DISK IO, which you might not want on an operational system.

    There is another option however - Programatically. I didn't have time to write the whole Perl script, but here is the short version:

    STEP 1 : Determin which fields are duplicated. This can be a simple:

    Code:
    mysql> select distinct name,count( name ) from bad_table2 group by 1 order by 2 desc;
    +-------------------+---------------+
    | name              | count( name ) |
    +-------------------+---------------+
    | Things Fall Apart |             3 |
    | The Famished Road |             2 |
    | Thirteen cents    |             2 |
    +-------------------+---------------+
    3 rows in set (0.00 sec)
    We only need to process entries with a count MORE then 1.

    STEP 2 : Loop through the results and delete the duplicates. The loop will produce something similar to:

    Code:
    mysql> DELETE FROM bad_table2 WHERE name = 'Things Fall Apart' LIMIT 2;
    Query OK, 2 rows affected (0.00 sec)
    
    mysql> select distinct name,count( name ) from bad_table2 group by 1 order by 2 desc;
    +-------------------+---------------+
    | name              | count( name ) |
    +-------------------+---------------+
    | The Famished Road |             2 |
    | Thirteen cents    |             2 |
    | Things Fall Apart |             1 |
    +-------------------+---------------+
    3 rows in set (0.00 sec)
    
    mysql> select * from bad_table2 where name = 'Things Fall Apart';
    +----+-------------------+
    | id | name              |
    +----+-------------------+
    |  4 | Things Fall Apart |
    +----+-------------------+
    1 row in set (0.01 sec)
    NOTES :

    1) NO additional memory was required.

    2) By adding LOW_PRIORITY to the script, the impact on a LIVE DB will be less.

    3) Only the "newest" record is retained. Some tinckering might be neccesary if you want to retain only the "oldest" - for whatever reason.

    Now, this method is tried and tested. One of my operational tables for IP accounting contains nearly 4 GB of data. Creating TEMP tables to delete duplicates doesn't cut it.

    Of course, smaller and even non-operational environments are different stories altogether.

    Cheers
    Last edited by nicc777; 05-08-2003 at 09:22 AM.

  3. #3
    Join Date
    May 2003
    Posts
    4
    Hi there,

    Found your example while doing a search on Google, but I am not having much luck and need a small hint on what I am not doing right...

    I have an image gallery that has several duplicate records (with a unique ID / primary key) so was using your "Duplicates with Unique Keys" example...

    I want to be able to only copy across records where the "bigimage" value is unique.

    The query I was trying (and I used HEAPS of different combinations) is:

    INSERT INTO photos_unique(user,userid,cat,date,title,descripti on,keywords,bigimage,width,height,filesize,views,m edwidth,medheight,medsize,approved,lastpost,rating ,watermarked,allowprint,extra1,extra2,extra3,extra 4,extra5,extra6,disporder)
    SELECT DISTINCT bigimage user,userid,cat,date,title,description,keywords,bi gimage,width,height,filesize,views,medwidth,medhei ght,medsize,approved,lastpost,rating,watermarked,a llowprint,extra1,extra2,extra3,extra4,extra5,extra 6,disporder FROM photos

    Ive added commas, brackets, moved DISTINCT all over the place but keep getting errors. The above query works BUT returns all records... Not what I wanted...

    Is there something dumb I am doing or an easier way ?

    Angus

  4. #4
    Join Date
    Feb 2003
    Location
    Johannesburg, South Africa
    Posts
    145
    I assume you noticed that there is a syntax error in the SQL you submitted:

    SELECT DISTINCT bigimage user,userid should read: SELECT DISTINCT bigimage,user,userid ( notice the comma between bigimage and user )

    With that out of the way, I see you mention that the SELECT query returns all records. I would like you to do an experiment:

    Code:
    SELECT 
       DISTINCT bigimage, COUNT( bigimage ) AS qty 
    FROM 
       photos 
    GROUP BY 1 HAVING qty > 1
    The above code is untested, but should work. The idea is to only display those rows where the 'qty' is more then 1 ( duplicates ). If nothing is returned, then there is no duplicates.

    Let us know what the results are.

    Cheers

  5. #5
    Join Date
    May 2003
    Posts
    4
    >> I assume you noticed that there is a syntax error in the SQL you submitted..

    Again - I tried it MANY MANY ways. With and without the comma...

    Code:
    SELECT 
       DISTINCT bigimage, COUNT( bigimage ) AS qty 
    FROM 
       photos 
    GROUP BY 1 HAVING qty > 1
    About 20 results returned...

    If I run

    SELECT * FROM `photos` WHERE 1 I get 16987 results...

    If I run

    SELECT DISTINCT bigimage FROM photos I get 15819 results...

  6. #6
    Join Date
    Feb 2003
    Location
    Johannesburg, South Africa
    Posts
    145
    Ok, and what happens if you do:

    Code:
    CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM photos
    How many entries in the 'bad_temp' table?

    Cheers

  7. #7
    Join Date
    May 2003
    Posts
    4
    Wasnt sure how to run what you suggested - ie couldnt find the table bad_temp...

    Ran

    CREATE TABLE bad_temp AS SELECT DISTINCT * FROM photos

    and got the 16,978 number... this would be correct though wouldnt it as the table has a value of id which is unique for each row (an auto incrementing value).

    Appreciate the attempt at helping so far though

  8. #8
    Join Date
    Feb 2003
    Location
    Johannesburg, South Africa
    Posts
    145
    That was the output I expected. Now, the longer way...

    Code:
    CREATE TABLE bad_temp AS SELECT 
      DISTINCT     
       bigimage,user,userid,cat,date,
       title,description,keywords,width,
       height,filesize,views,medwidth,
       medheight,medsize,approved,
       lastpost,rating,watermarked,
       allowprint,extra1,extra2,extra3,
       extra4,extra5,extra6,disporder 
    FROM 
       photos
    You might need to dropt the previously created temp table first.

    Cheers

  9. #9
    Join Date
    May 2003
    Posts
    4
    16978 records still....

  10. #10
    Join Date
    Dec 2002
    Location
    Cape Town, South Africa
    Posts
    75
    Your problem seems to be that:

    SELECT
    DISTINCT
    bigimage,user,userid,cat,date,
    title,description,keywords,width,
    height,filesize,views,medwidth,
    medheight,medsize,approved,
    lastpost,rating,watermarked,
    allowprint,extra1,extra2,extra3,
    extra4,extra5,extra6,disporder
    FROM
    photos

    does not return the correct 15819 results. This can be due to:
    1) A 'bug' in MySQL
    2) That combination actually does contain 16978 unique combinations.

    You obviously have 15819 distinct bigimages, but that does not necessarily mean the entire row is unique. For example:
    image1,3
    image1,4
    Only one distinct 'image1', but 2 distinct rows. So double-check your data to be sure it's a 'bug'.
    If you're sure it's MySQL misbehaving, try create an index (not unique) on some fields. Perhaps MySQL is not returning the correct data because its table metadata is not correct. You don't give the full table structure, whether there are other indexes, but try analyse/optimize these to make sure they're in shape.
    If you're still having trouble, post your full table structure, and an example of a duplicate returned by your DISTINCT statement.

  11. #11
    Join Date
    Feb 2003
    Location
    Johannesburg, South Africa
    Posts
    145
    greenman's taking the words right out of my mouth - or is that keyboard

    This is a very strange behaviour, and something that I have never seen before.

    All I can do is echo what greenman's suggested.

    Cheers

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •