Showing posts with label MERGE. Show all posts
Showing posts with label MERGE. Show all posts

Sunday, January 22, 2012

TABLE Type and MERGE Statement

As I get many offline questions regarding the presentation “Reducing RoundTrips” I did, thought to summarize the content we discussed as a post. Session spoke about how to reduce multiple database calls from client to database application when an updated data set (Ex. data table bound to a grid) is required to send from client application to database. In addition to that we discussed the way of handling all three operations: INSERT, UPDATE, and DELETE using a single statement. This is what the session focused on;

MERGE
MERGE statement is used to insert data that does not exist but to update if it does exist. It is an extension of UPSERT which is a known term in other database management systems for performing both UPDATE and INSERT statement using a single technique. The main benefit of this is the facility to manipulate many number of records in a single table using a source as an atomic operation. This eliminates individual statements for INSERT, UPDATE and DELETE operations, one statement which is MERGE handles all three operations. MERGE statement has following elements;

  • Target: This the table or view (destination) which needs to be updated.
  • Source: This contains updated data (modified, newly added) and will be used for updating the target. This can be a table, view, derived table, CTE, or table function.
  • WHEN MATCHED [AND] ...... THEN: This is where the action which needs to be performed when the row in the source is found in the target, should be written. Two WHEN MATCHED clauses are allowed, limiting one for UPDATE as an action and other for DELETE as an action.
  • WHEN NOT MATCHED [BY TARGET] ...... THEN: The action which performs when the row in the source is not found in the target, has to be written with this. It is usually the INSERT action.
  • WHEN NOT MATCHED [BY SOURCE] ...... THEN: This is to perform an action when the row in the target is not supplied with the source. Usually it is DELETE.
  • OUTPUT clause and $Action: The standard inserted and deleted virtual tables are available with MERGE. In addition to that $Action provides the operation performed; INSERT, UPDATE, DELETE. This is normally used for logging/auditing.

In order to use MERGE, a source must be defined and filled. Usually the source is defined and filled by the client but for various limitations and less facilities, we used to iterate the filled-source at the client site itself and send one row at a time to database, which makes many calls to database for updating all records in the source.

TABLE Type
This can be overcome using TABLE data type. TABLE data type can be created as a user-defined data type definition and can be used as either variable or parameter. Following are the key elements of creating a type;

  • CREATE TYPE: This is used for creating the user defined type. It has to be created as a TABLE and structure has to be defined.
  • READONLY keyword: If the type is used as a parameter, READONLY keyword has to be used. It is as an OUTPUT parameter is not supported.
  • SqlDbType.Structured: When a data table from a .NET application is sent, parameter data type must be declared as System.Data.SqlDbType.Structured.

Here is a sample code for all;

This code creates a table called Customer and inserts 4 rows. In addition to that it creates a log table too, for recording operations.

USE tempdb
GO
 
-- create a table for customer
CREATE TABLE dbo.Customer
(
    CustomerId int identity(1,1) not null
    , CustomerCode char(4) not null
    , FirstName varchar(25) null
    , LastName varchar(25) not null
    , CreditLimit money not null
    , IsLocal bit not null
    , LastModified smalldatetime not null
)
GO
 
-- insert four records
INSERT INTO Customer
    (CustomerCode, FirstName, LastName, CreditLimit, IsLocal, LastModified)
VALUES
    ('C001', 'Dinesh', 'Priyankara', 10000, 1, getdate())
    , ('C002', 'Jane', 'K', 10000, 1, getdate())
    , ('C003', 'Martinie', 'Ku', 10000, 0, getdate())
    , ('C004', 'Joe', 'Anderson', 10000, 0, getdate())
GO
 
CREATE TABLE dbo.CustomerLog
(
    CustomerCode char(4) not null
    , DateModified datetime not null
    , Action varchar(15) not null
)    
GO

Here is the stored procedure for accepting an updated data set using TABLE type and updating Customer using MERGE.

-- create a stored procedure for accepting created TABLE type as a parameter
-- note that READONLY has to be used
CREATE PROC dbo.InsertCustomer @CustomerType dbo.CustomerType READONLY
AS
BEGIN
 
    INSERT INTO dbo.CustomerLog
    (CustomerCode, DateModified, [Action])
    SELECT o.CustomerCode, getdate(), o.[Action]
    FROM
        (MERGE INTO dbo.Customer AS c
        USING @CustomerType AS t
            ON c.CustomerCode = t.CustomerCode
        WHEN MATCHED AND t.IsLocal = 0 THEN
            UPDATE SET c.FirstName = t. FirstName
                    , c.LastName = t.LastName
                    , c.CreditLimit = t.CreditLimit
                    , c.LastModified = getdate()
        WHEN NOT MATCHED THEN
            INSERT (CustomerCode, FirstName, LastName, CreditLimit, IsLocal, LastModified)
                VALUES
                    (t.CustomerCode, t.FirstName, t.LastName, t.CreditLimit, t.IsLocal, getdate())
        WHEN NOT MATCHED BY SOURCE THEN
            DELETE
        OUTPUT isnull(inserted.CustomerCode, deleted.CustomerCode) as CustomerCode, $ACTION AS Action) AS o
            
END
GO

Everything is ready from database level. Here is a sample code for .NET application.

SqlConnection connection = new SqlConnection(@"Server=.\SQL2008R2;Integrated Security=SSPI;Database=tempdb");
DataTable customer = new DataTable("Customer");
customer.Columns.Add("CustomerCode", typeof(string));
customer.Columns.Add("FirstName", typeof(string));
customer.Columns.Add("LastName", typeof(string));
customer.Columns.Add("CreditLimit", typeof(decimal));
customer.Columns.Add("IsLocal", typeof(bool));
 
customer.Rows.Add("C001", "Dinesh", "Priyankara", 10000, true); // no change and no update
customer.Rows.Add("C002", "Jane", "Kani", 25000, true); // should not update
customer.Rows.Add("C004", "Joe", "Andrew", 35000, true); // update all
customer.Rows.Add("C005", "Kate", "Neo", 35000, true); // new record
 
SqlCommand command = new SqlCommand("InsertCustomer", connection);  
command.CommandType = CommandType.StoredProcedure;
 
SqlParameter parameter = new SqlParameter("@CustomerType", System.Data.SqlDbType.Structured);
parameter.Value = customer;
command.Parameters.Add(parameter);
 
connection.Open();
command.ExecuteNonQuery();
connection.Close();

Enjoy!

Monday, November 1, 2010

SSIS: Replacing Slowly Changing Dimension (SCD) wizard with the MERGE statement

Slowly Changing Dimensions are business entities in DW/BI systems that have attributes that do not change very often. In order to handle loading of SCDs, we use SQL Server Integration Services Data Flow component: Slowly Changing Dimension Component. It is a wizard that helps us to handle changing attributes.

If you are a SSIS developer and have used SCD wizard, you have already seen the bad performance on it. It works fine with a dataset that contains records less than something around 5000 (this is what I have seen, please do not consider it is as a benchmark) but gives very slow performance on beyond. Recently, I noticed that one of the projects done had used SCD wizard for almost all data loading and the time it takes for completing the load has gone from 5 minutes for 5-6 hours. It significantly increases the time when the data load is getting higher and higher.

How to improve the performance of loading of SCDs?
I had to give a solution for this…. two things came into my mind, one is Kimball Method SSIS Slowly Changing Dimension Component (will make a separate post on this) and the other is MERGE T-SQL statement. Although the MERGE implementation is bit complex, I thought to recommend the MERGE because many have accepted that the performance it gives is superb. The below table shows a comparison made on three different approaches on SCDs loading with some criteria (which is quoted from Matt Masson’s presentation named Performance Design Pattern).

table

You can easily notice that MERGE gives the best performance though there are drawbacks.

Type of Slowly Changing Dimensions
Slowly Changing Dimensions are categorized into three types named: Type 1, Type 2, and Type3. The Type 1 SCD does not maintain the history of changing attributes, it overwrites values of the attributes. Type 2 maintains historical values for changing attributes. Type 3 that we do not use much (I have not used it at all) maintains separate columns for changed attributes. SSIS SCD wizard supports both Type 1 and Type 2.

Using MERGE instead of SCD wizard
Replacement of SCD wizard with MERGE is not a straightforward technique. If the SCD has both Type 1 and Type 2 types attributes, they need to be handled separately. Let’s create a simple table with following set of data, in order to see the way of handling them with MERGE.
table1 Following attributes in this table are Type 1:

  • FirstName
  • LastName

Following attributes are Type 2:

  • MaritalStatus
  • Country

Following script creates the Dimension table and inserts data;

CREATE TABLE dbo.DimCustomer
(
    CustomerKey int IDENTITY(1,1) PRIMARY KEY
    , SourceCode char(5) NOT NULL
    , FirstName varchar(50) NOT NULL -- Type 1
    , LastName varchar(50) NOT NULL -- Type 1
    , MaritalStatus char(1) NOT NULL -- Type 2
    , Country  varchar(50) NOT NULL -- Type 2
    , StartDate date NOT NULL
    , EndDate date NOT NULL
    , CONSTRAINT IX_DimCustomer_SourceCode UNIQUE (SourceCode)
)
GO
 
INSERT INTO dbo.DimCustomer
    (SourceCode, FirstName, LastName, MaritalStatus, Country
    , StartDate, EndDate)
VALUES
    ('AAAA1', 'Dinesh', 'Priyankara', 'M', 'Sri Lanka'
        , '01/01/2009', '12/31/9999')
    , ('AAAA2', 'Yeshan', 'Senthush', 'S', 'Sri Lanka'
        , '01/01/2009', '12/31/9999')
    , ('AAAA3', 'Jane', 'Alvarez', 'S', 'France'
        , '01/01/2009', '12/31/9999')
    , ('AAAA4', 'Shannon', 'Carlson', 'M', 'India'
        , '01/01/2009', '12/31/9999')
    
GO

When data loading of SCDs are handled with SSIS,what we usually do is, once the data set is captured, we send it through SCD component. The SCD component inserts new data, updates Type 1 data, and finally inserts Type 2 data. If we are to use MERGE, we need to send the captured data to a temporary table and then use Execute SQL Task for executing the MERGE statement. Assume that following table is the temporary table and it is loaded with captured data. 
table2

This code creates the table and populates data;

CREATE TABLE dbo.TempDimCustomer
(
    SourceCode char(5) NOT NULL
    , FirstName varchar(50) NOT NULL
    , LastName varchar(50) NOT NULL
    , MaritalStatus char(1) NOT NULL
    , Country  varchar(50) NOT NULL
)    
GO
 
INSERT INTO dbo.TempDimCustomer
    (SourceCode, FirstName, LastName, MaritalStatus, Country)
VALUES
    ('AAAA1', 'Dinesh', 'Priyankara', 'M', 'Sri Lanka')
    , ('AAAA2', 'Yeshan', 'Santhush', 'S', 'Sri Lanka')
    , ('AAAA3', 'Jane', 'Alvarez', 'M', 'France')
    , ('AAAA4', 'Shannon', 'Carlson', 'M', 'German')
    , ('AAAA5', 'Jon', 'Yang', 'M', 'Japan')
GO

Handling Type 1 attributes
The below codes shows the MERGE statement that updates Type 1 attributes. Note that it goes through all records and updates.

MERGE INTO dbo.DimCustomer c
USING dbo.TempDimCustomer tc
ON (c.SourceCode = tc.SourceCode)
WHEN MATCHED 
    AND 
        (c.FirstName != tc.FirstName
        OR c.LastName != tc.LastName)
    THEN UPDATE
        SET c.FirstName = tc.FirstName
            , c.LastName = tc.LastName;

Handling Type 2 attributes
Handling Type 2 is tricky. If a record is changed, we need to update the old record, setting the EndDate as current date, and then insert a new record for it, with StartDate as current date. See the code below;

INSERT INTO DimCustomer
    (SourceCode, FirstName, LastName, MaritalStatus, Country
    , StartDate, EndDate)
SELECT Modified.SourceCode, Modified.FirstName
    , Modified.LastName, Modified.MaritalStatus
    , Modified.Country, Modified.StartDate
    , Modified.EndDate
FROM
    (MERGE INTO dbo.DimCustomer c
    USING dbo.TempDImCustomer tc
    ON (c.SourceCode = tc.SourceCode)
    WHEN NOT MATCHED 
        THEN INSERT
            (SourceCode, FirstName, LastName, MaritalStatus, Country
            , StartDate, EndDate)
            VALUES
            (tc.SourceCode, tc.FirstName, tc.LastName, tc.MaritalStatus, tc.Country
            , GetDate(), '12/31/9999')
    WHEN MATCHED 
        AND c.EndDate = '12/31/9999'
        AND (c.MaritalStatus != tc.MaritalStatus
            OR c.Country != tc.Country)
        THEN UPDATE
            SET c.EndDate = getDate()
        OUTPUT 
            $Action Action, tc.SourceCode, tc.FirstName, tc.LastName
            , tc.MaritalStatus, tc.Country
            , '10/28/2010' StartDate, '12/31/9999' EndDate) AS Modified
WHERE Modified.Action = 'UPDATE';

See the inner MERGE first. With first NOT MATCHED section, it inserts new records. Next is for finding MACTHED and Type 2 changes. Note that the check is done only on latest records (EndDate = ‘12/31/9999’). If found, EndDate is updated and records are returned as a OUTPUT of the MERGE. This output is captured by outer INSERT statement and inserts them as new records. Done!

Please do test both approaches before deciding the technique. If the data set is large, it would be better to use MERGE instead of SSIS SCD component.