Friday, December 30, 2016

Inserting records, ignoring duplicates, recording them

A column like email address in en entity like Customer or Employee is always set with an UNIQUE Constraint to make sure that no duplicates values are inserted. This is achieved through an Unique Index setting up with the column. This does not annoy us when we accept only one record at a time for inserting or updating but it surely annoys us when we perform a Bulk Operation.

There are instances that we load large number of records as a single batch. For an example, if we get 1 million records to be inserted to a table like Customer that has Email Column, if one duplicate is detected within the batch, the whole batch is getting rolled back. Sometimes, we need to continue with all other records ignoring duplicates but the standard Unique Index does not allow us to do it.

Here is an example.

USE tempdb;
GO

-- Creating a table for Customers
IF OBJECT_ID('dbo.Customers') IS NOT NULL
 DROP TABLE dbo.Customers;
GO
CREATE TABLE Customers
(
 Title char(5) NOT NULL
 , FirstName varchar(50) NOT NULL
 , MiddleName varchar(50)  NULL
 , LastName varchar(50) NOT NULL
 , EmailAddress varchar(100) NOT NULL
 
)

-- Addin the unique index on the Email Column
CREATE UNIQUE INDEX IX_Customers ON dbo.Customers (EmailAddress)

-- Inserting one customer for testing
INSERT INTO dbo.Customers
 (Title, FirstName, MiddleName, LastName, EmailAddress)
VALUES
 ('Mr.', 'Dinesh', NULL, 'Priyankara', 'dinesh@dinesql.com')

If we try to insert records as a batch that has duplicates, none of them will be inserted.

-- This batch contains 4 records, note the 2nd and 3rd record
-- They have the same emial address
INSERT INTO dbo.Customers
 (Title, FirstName, MiddleName, LastName, EmailAddress)
SELECT 'Mr.', 'Yeshan', NULL, 'Santhush', 'yeshan@dinesql.com'
UNION ALL
SELECT 'Ms.', 'Jane', NULL, 'Western', 'jane@dinesql.com'
UNION ALL
SELECT 'Ms.', 'Jane', NULL, 'Black', 'jane@dinesql.com'
UNION ALL
SELECT 'Mr.', 'Jack', NULL, 'Knight', 'jack@dinesql.com'

The above operation results this;

Msg 2601, Level 14, State 1, Line 31
Cannot insert duplicate key row in object 'dbo.Customers' with unique index 'IX_Customers'. 
The duplicate key value is (jane@dinesql.com).
The statement has been terminated.

If we need to make sure that the operation ignores duplicates and continues, all we have to do is, add IGNORE_DUP_KEY option to the index.

-- dropping and recreating the index with IGNORE_DUP_KEY option
DROP INDEX IX_Customers ON dbo.Customers 
GO

CREATE UNIQUE INDEX IX_Customers ON dbo.Customers (EmailAddress) WITH (IGNORE_DUP_KEY = ON)

-- checking the insert again. It inserts all except the duplicate
INSERT INTO dbo.Customers
 (Title, FirstName, MiddleName, LastName, EmailAddress)
SELECT 'Mr.', 'Yeshan', NULL, 'Santhush', 'yeshan@dinesql.com'
UNION ALL
SELECT 'Ms.', 'Jane', NULL, 'Western', 'jane@dinesql.com'
UNION ALL
SELECT 'Ms.', 'Jane', NULL, 'Black', 'jane@dinesql.com'
UNION ALL
SELECT 'Mr.', 'Jack', NULL, 'Knight', 'jack@dinesql.com'

Though this works fine, an obvious question comes up is, how do we know what records have been ignored? There is no direct method to capture ignored records with this solution however following can be used with some additional coding;
  1. Load all email addresses to a Temporary Table or Table Variable before performing the main operation. After main Insert Operation, do a comparison using LEFT OUTER JOIN to find out records that were not inserted.
  2. Use Integration Services for the operation. It has allows you to capture problematic records using error handlers or can duplicate the flow, record all emails addresses in another container, and do the comparison at the end of the task.

No comments: