-
Improvement to article: Deleting Duplicate Rows in a MySQL Database
A reader has just pointed out an improvement to my most recent article
On page 2, you can of course use:
CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM bad_table
instead of:
CREATE TEMPORARY TABLE
bad_temp(id INT,name VARCHAR(20))
TYPE=HEAP;
INSERT INTO bad_temp(id,name) SELECT DISTINCT id,name FROM bad_table;
saving yourself doing it in two steps.
Thanks to P-A Fredriksson. Hopefully the changes will reflect on the site soon.
-
Perl Option - no Temp Tables
You all probably read the article, especially:
Be careful when using temporary tables in this way. With our tiny data set, there is not much of a problem, but huge tables could cause problems with memory. etc...
BUT, creating a PERMANENT temporary table on LARGE data sets will move your memory problem from RAM to DISK SPACE. There is also additional DISK IO, which you might not want on an operational system.
There is another option however - Programatically. I didn't have time to write the whole Perl script, but here is the short version:
STEP 1 : Determin which fields are duplicated. This can be a simple:
Code:
mysql> select distinct name,count( name ) from bad_table2 group by 1 order by 2 desc;
+-------------------+---------------+
| name | count( name ) |
+-------------------+---------------+
| Things Fall Apart | 3 |
| The Famished Road | 2 |
| Thirteen cents | 2 |
+-------------------+---------------+
3 rows in set (0.00 sec)
We only need to process entries with a count MORE then 1.
STEP 2 : Loop through the results and delete the duplicates. The loop will produce something similar to:
Code:
mysql> DELETE FROM bad_table2 WHERE name = 'Things Fall Apart' LIMIT 2;
Query OK, 2 rows affected (0.00 sec)
mysql> select distinct name,count( name ) from bad_table2 group by 1 order by 2 desc;
+-------------------+---------------+
| name | count( name ) |
+-------------------+---------------+
| The Famished Road | 2 |
| Thirteen cents | 2 |
| Things Fall Apart | 1 |
+-------------------+---------------+
3 rows in set (0.00 sec)
mysql> select * from bad_table2 where name = 'Things Fall Apart';
+----+-------------------+
| id | name |
+----+-------------------+
| 4 | Things Fall Apart |
+----+-------------------+
1 row in set (0.01 sec)
NOTES :
1) NO additional memory was required.
2) By adding LOW_PRIORITY to the script, the impact on a LIVE DB will be less.
3) Only the "newest" record is retained. Some tinckering might be neccesary if you want to retain only the "oldest" - for whatever reason.
Now, this method is tried and tested. One of my operational tables for IP accounting contains nearly 4 GB of data. Creating TEMP tables to delete duplicates doesn't cut it.
Of course, smaller and even non-operational environments are different stories altogether.
Cheers
Last edited by nicc777; 05-08-2003 at 09:22 AM.
-
Hi there,
Found your example while doing a search on Google, but I am not having much luck and need a small hint on what I am not doing right...
I have an image gallery that has several duplicate records (with a unique ID / primary key) so was using your "Duplicates with Unique Keys" example...
I want to be able to only copy across records where the "bigimage" value is unique.
The query I was trying (and I used HEAPS of different combinations) is:
INSERT INTO photos_unique(user,userid,cat,date,title,descripti on,keywords,bigimage,width,height,filesize,views,m edwidth,medheight,medsize,approved,lastpost,rating ,watermarked,allowprint,extra1,extra2,extra3,extra 4,extra5,extra6,disporder)
SELECT DISTINCT bigimage user,userid,cat,date,title,description,keywords,bi gimage,width,height,filesize,views,medwidth,medhei ght,medsize,approved,lastpost,rating,watermarked,a llowprint,extra1,extra2,extra3,extra4,extra5,extra 6,disporder FROM photos
Ive added commas, brackets, moved DISTINCT all over the place but keep getting errors. The above query works BUT returns all records... Not what I wanted...
Is there something dumb I am doing or an easier way ?
Angus
-
I assume you noticed that there is a syntax error in the SQL you submitted:
SELECT DISTINCT bigimage user,userid should read: SELECT DISTINCT bigimage,user,userid ( notice the comma between bigimage and user )
With that out of the way, I see you mention that the SELECT query returns all records. I would like you to do an experiment:
Code:
SELECT
DISTINCT bigimage, COUNT( bigimage ) AS qty
FROM
photos
GROUP BY 1 HAVING qty > 1
The above code is untested, but should work. The idea is to only display those rows where the 'qty' is more then 1 ( duplicates ). If nothing is returned, then there is no duplicates.
Let us know what the results are.
Cheers
-
>> I assume you noticed that there is a syntax error in the SQL you submitted..
Again - I tried it MANY MANY ways. With and without the comma...
Code:
SELECT
DISTINCT bigimage, COUNT( bigimage ) AS qty
FROM
photos
GROUP BY 1 HAVING qty > 1
About 20 results returned...
If I run
SELECT * FROM `photos` WHERE 1 I get 16987 results...
If I run
SELECT DISTINCT bigimage FROM photos I get 15819 results...
-
Ok, and what happens if you do:
Code:
CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM photos
How many entries in the 'bad_temp' table?
Cheers
-
Wasnt sure how to run what you suggested - ie couldnt find the table bad_temp...
Ran
CREATE TABLE bad_temp AS SELECT DISTINCT * FROM photos
and got the 16,978 number... this would be correct though wouldnt it as the table has a value of id which is unique for each row (an auto incrementing value).
Appreciate the attempt at helping so far though
-
That was the output I expected. Now, the longer way...
Code:
CREATE TABLE bad_temp AS SELECT
DISTINCT
bigimage,user,userid,cat,date,
title,description,keywords,width,
height,filesize,views,medwidth,
medheight,medsize,approved,
lastpost,rating,watermarked,
allowprint,extra1,extra2,extra3,
extra4,extra5,extra6,disporder
FROM
photos
You might need to dropt the previously created temp table first.
Cheers
-
-
Your problem seems to be that:
SELECT
DISTINCT
bigimage,user,userid,cat,date,
title,description,keywords,width,
height,filesize,views,medwidth,
medheight,medsize,approved,
lastpost,rating,watermarked,
allowprint,extra1,extra2,extra3,
extra4,extra5,extra6,disporder
FROM
photos
does not return the correct 15819 results. This can be due to:
1) A 'bug' in MySQL
2) That combination actually does contain 16978 unique combinations.
You obviously have 15819 distinct bigimages, but that does not necessarily mean the entire row is unique. For example:
image1,3
image1,4
Only one distinct 'image1', but 2 distinct rows. So double-check your data to be sure it's a 'bug'.
If you're sure it's MySQL misbehaving, try create an index (not unique) on some fields. Perhaps MySQL is not returning the correct data because its table metadata is not correct. You don't give the full table structure, whether there are other indexes, but try analyse/optimize these to make sure they're in shape.
If you're still having trouble, post your full table structure, and an example of a duplicate returned by your DISTINCT statement.
-
greenman's taking the words right out of my mouth - or is that keyboard
This is a very strange behaviour, and something that I have never seen before.
All I can do is echo what greenman's suggested.
Cheers
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
|