Sunday, December 27, 2015

SQL Server Brain Basher of the Week #037 - TSQL - Deleting duplicates

Deleting records is a very common activity performed against databases and everyone is familiar with DELETE statement. Here is a question based on DELETE statement.

What is the easiest way of deleting duplicated records from a table if you can identify the duplicated records by a column?

There can be multiple ways of doing it. Before seeing the solution, let's make a table with duplicates.

IF OBJECT_ID('dbo.ProductCode') IS NOT NULL
 DROP TABLE dbo.ProductCode;
GO

CREATE TABLE dbo.ProductCode
(
 ProductCode char(5)
 , ProductType char(5)
);
GO

INSERT INTO dbo.ProductCode
VALUES
 ('PR001', 'T0001')
 , ('PR001', 'T0001')
 , ('PR002', 'T0002')
 , ('PR003', 'T0003')
 , ('PR001', 'T0001')
 , ('PR004', 'T0004')
 , ('PR003', 'T0003')



Here are two ways of deleting records, both work fine, however first method might give better performance over a large dataset.

-- Here is the way of deleting duplicates
WITH cte AS
(
 SELECT *, 
  ROW_NUMBER() OVER (PARTITION BY ProductCode ORDER BY ProductCode) AS RowNumber
 FROM dbo.ProductCode)
DELETE FROM cte
WHERE RowNumber > 1 ;

-- Checking records
SELECT * FROM dbo.ProductCode
ORDER BY ProductCode;


-- Here is another way of deleting them
SET ROWCOUNT 1
WHILE 1=1
BEGIN

 DELETE 
 FROM dbo.ProductCode
 WHERE ProductCode IN (SELECT ProductCode
       FROM dbo.ProductCode
       GROUP BY ProductCode
       HAVING COUNT(*) > 1)
 IF @@ROWCOUNT = 0
  BREAK;
END
SET ROWCOUNT 0

-- Checking records
SELECT * FROM dbo.ProductCode
ORDER BY ProductCode;


No comments: