Results 1 to 9 of 9

Thread: Retrival and deletion of duplicate rows.

  1. #1
    Babu Ambati Guest

    Retrival and deletion of duplicate rows.


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  2. #2
    Sergei Didur Guest

    Retrival and deletion of duplicate rows. (reply)

    First, build clustered index on most selective column.
    Then select list distinct values from indexed column.
    Using this list you can logically split your huge table into small sets.
    For each set use SELECT DISTINCT on all columns and insert results into Work Table. After you done rename tables.

    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  3. #3
    Guest

    Retrival and deletion of duplicate rows. (reply)


    1)There is an easy way to generate SQL script to identify duplicate records. If you can link your SQL table to an Access database. Open your database on Access. On the Query Tab, Click NEW, on the POP-up Menu, Click "Find Duplicate Query Wizard. Follow the Wizard to create the query. And Open your query in Design and View it in SQL View. You are almost there. Copy the SQL statements to a SQL Query Analyzer, Play with the code until you satisfied.

    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  4. #4
    Babu Ambati Guest

    Retrival and deletion of duplicate rows. (reply)

    Hi Serei,
    To have a clustered index i will require lot of space. The table has 2.7 million rows. Can you post the code for selecting distinct values. Like from start to scrach how to do the retrival and eliminate of duplicate records.


    ------------
    Sergei Didur at 11/17/00 2:37:44 PM

    First, build clustered index on most selective column.
    Then select list distinct values from indexed column.
    Using this list you can logically split your huge table into small sets.
    For each set use SELECT DISTINCT on all columns and insert results into Work Table. After you done rename tables.

    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  5. #5
    Richard Chu Guest

    Retrival and deletion of duplicate rows. (reply)

    Babu, try this:

    -- first generate a identity-keyed work table containing all the rows
    -- (SELECT INTO/ BULK COPY option needs to be set ON)

    select IDENTITY(int,0,1) AS RecNo, *
    into tb1_DEDUP
    from tb1
    go


    -- then generate a table containing the keys of all the duplicated records

    select RecNo into tb1_DEDUP_keys
    from tb1_DEDUP
    where RecNo not IN
    (select MIN (RecNo) from tb1_DEDUP
    group by column1, column2, column3)

    -- add a clustered index to the work tables to optimize queries ..

    create clustered index tb1_DEDUP_CLX on tb1_DEDUP (RecNo)
    create clustered index tb1_DEDUP_keys_CLX on tb1_DEDUP_keys (RecNo)

    -- query to show/ delete dup records 5000 rows at a time
    -- repeat this query until 0 rows are affected

    set rowcount 5000

    delete tb1_DEDUP
    -- to show records to be deleted, replace 'delete' with 'SELECT t1.*'
    from tb1_DEDUP t1, tb1_DEDUP_keys t2
    where t1.RecNo = t2.RecNo

    set rowcount 0

    --- when tb1_DEDUP has been checked out good to go
    --- then truncate the target table and insert the deduped
    --- rows back in, 5000 or whatever number of rows at a time ..

    truncate table tb1 -- non-logged deletion of all rows

    insert into tb1
    select column1, column2, column3 ..
    from tb1_DEDUP
    where RecNo >= 0 and RecNo <= 5000

    .................................................. ...
    Richard
    R International,Inc.
    Wash. DC 20001



    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  6. #6
    Urvish Panchal Guest

    Retrival and deletion of duplicate rows. (reply)

    Hi there,

    Try this procedure.

    CREATE PROCEDURE sp_findDupValues
    (
    @tablename varchar( 40 ),
    @colname varchar( 40 )
    )
    AS
    DECLARE @sqlString VARCHAR( 255 )

    SELECT @sqlString =
    &#39;SELECT &#39; + @colname +
    &#39; ,COUNT( * ) AS &#34;COUNT&#34; &#39; +
    &#39; FROM &#39; + @tablename +
    &#39; GROUP BY &#39; + @colname +
    &#39; HAVING count(*) > 1 &#39;

    EXEC( @sqlString )




    ------------
    Kathy Overcash at 11/20/00 8:57:16 AM


    I recommend that you create another table to move your data into. Create a unique index using &#34;with IGNORE_DUP_KEY&#34; on your new table. When you insert your records into the new table, the duplicate rows will not be inserted. You won&#39;t get an error either. The duplicate key rows will be ignored. You can delete your old table and rename your new table to the original table name. Test it out on a small set of test data to see if it gives you the results you are looking for.

    CREATE UNIQUE INDEX UniqueNames on tblNames (FirstName, LastName, MI)
    WITH IGNORE_DUP_KEY



    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  7. #7
    Suresh Peri Guest

    Retrival and deletion of duplicate rows. (reply)


    Hi Babu,

    The solution of creating index with IGNORE_DUPKEY works well.But if you don&#39;t want to create another table,you can use Cursors for positioned deletes.

    You need to create a cursor to get the count of duplicate rows

    eg:declare cur cursor for select distinct * from <table name>
    then get the count of rows in the table for each row fetched by the cursor

    while this count is greater than 1 you have to create another cursor for update and delete the rows(with positioned deletes).

    If you want i can give you the code to achieve this.


    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  8. #8
    Richard Sibbit Guest

    Retrival and deletion of duplicate rows. (reply)


    The way I do it is:

    First get all duplicate rows and place in a tempTable:-

    Select col1, col2, col3, count(*) as num into tempDupes
    group by col1, col2, col3
    having count(*) > 1

    Then get the distinct data for these dupe rows:-

    Select distinct a.* into tempDupesData
    From yourTable a, tempDupes b
    Where a.col11 = b.col1 and a.col2 = b.col2 and a.col3 = b.col3

    Now delete data from your table:-
    Delete a
    From yourTable a, tempDupes b
    Where a.col11 = b.col1 and a.col2 = b.col2 and a.col3 = b.col3

    Reinsert the distinct data:-
    Insert into yourTable (col1, ... etc)
    Select col1, ...etc from tempDupesData b

    Hope this helps - does it all in one go and may be useful for the future.

    Richard Sibbit
    DBT Limited

    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

  9. #9
    Suresh Guest

    Retrival and deletion of duplicate rows. (reply)

    Hi Babu try this..
    ------------------
    CREATE PROCEDURE [p] AS

    DECLARE @a int,@b char(10) ,@c int

    DECLARE cur CURSOR FOR

    SELECT a,b,COUNT(*) FROM t1 GROUP BY a,b HAVING COUNT(*)>1

    OPEN cur

    FETCH NEXT FROM cur INTO @a,@b,@c

    WHILE @@FETCH_STATUS=0

    BEGIN

    SET @c=@c-1

    SET ROWCOUNT @c

    DELETE FROM t1 WHERE a=@a AND b=@b

    SET ROWCOUNT 0

    FETCH NEXT FROM cur INTO @a,@b,@c

    END

    CLOSE cur

    DEALLOCATE cur




    ------------
    Suresh Peri at 11/21/00 1:59:54 AM


    Hi Babu,

    The solution of creating index with IGNORE_DUPKEY works well.But if you don&#39;t want to create another table,you can use Cursors for positioned deletes.

    You need to create a cursor to get the count of duplicate rows

    eg:declare cur cursor for select distinct * from <table name>
    then get the count of rows in the table for each row fetched by the cursor

    while this count is greater than 1 you have to create another cursor for update and delete the rows(with positioned deletes).

    If you want i can give you the code to achieve this.


    ------------
    Babu Ambati at 11/17/00 1:52:14 PM


    I have a table...say tb1 of 20 columns which has 2.7 million rows. There is no PK and the only way of identifying a unique row can be done with combination of column1+column2+column3.
    Can anyone help me how to idetify the duplicate rows and also delete the duplicate rows. And to commit after every 5000 rows.
    ITS VERY URGENT....Thanks in advance.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •