Thursday, April 30, 2015

Incremental ETL: How to identify changes for fact data when no support from the source

"Data warehouse is a read-only database": this is one of the characteristics of data warehouse and we used to say "History cannot be changed" for supporting this characteristic. Therefore we always try our best not to bring changes into the data warehouse.

This was thoroughly considered and applied with traditional data warehousing because the cost of processing OLAP models was very high. As I see, with modern implementations, this cost has gone down for two reasons; Tabular Model implementations instead of Multi-Dimensional Models and User-Specific Models with Self-Service BI. Therefore, if it is required, changes can be brought into the data warehouse for some extent considering both cost and benefits of it.

However, when we try to bring changes into the data warehouse, one major issue we face is, less support-ability from sources for finding changes in previous transactions that have already been loaded. Some sources maintain a timestamp indicating the changed date or a flag indicating whether the record is modified or not. If there is no such thing with the source, changes cannot be easily identified. In this scenario, one way of identifying changes is, checking and comparing each and every record, each and every coloumn in the fact table and seeing whether they are different. If the dataset is very large and data warehouse is large too, this is not much practical. Therefore either we should stop taking changes once data is loaded to the data warehouse or should take with some agreed rules.

Here is one of the ways I have used. First of all, we need to set an acceptable time period for accepting changes from the source. The agreed period can be 1 week, 1 month or 3 months. However this is based on certain factors such as business logic involved with transactions, complexity and volume. Once agreed, next step is, holding loaded data in the staging environment for the same period. If we have to expect changes for last three months, we need to make sure that staging database has last three months extracted data. In addition to that, we need to maintain an additional column for holding checksum. That is what we have to used for comparing records between the source and staging, not comparing each and every column.

For example, assume that Sales data related to FactSales in SalesDataWarehouse is loaded via a staging table which is called Sales. For that, considering the above requirement, the structure of the staging table must be created as below;

-- Staging table for holding sales
CREATE TABLE dbo.Sales
(
 SalesOrderID int NOT NULL
 , OrderDate datetime NOT NULL
 , ShipDate datetime NOT NULL
 , SalesOrderNumber varchar(25) NOT NULL
 , Product varchar(100) NOT NULL
 , ListPrice money NOT NULL
 , UnitPrice money NOT NULL
 , OrderQty int NOT NULL
 , LineTotal money NOT NULL
 , CheckSumValue int NOT NULL -- Checksum value from above columns
 , RecordType tinyint NOT NULL -- 1-New, 2-Modified, 3-Loaded
);
GO

-- Creating clustered index
CREATE UNIQUE CLUSTERED INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

-- Creating an index on Checksum column
CREATE INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

When we load new data, for an example, yesterday data, we load data into the staging table with RecordType = 1. Once inserted, must generate the Checksum value for newly inserted records.

UPDATE dbo.Sales
 SET CheckSumValue = Checksum(SalesOrderID, OrderDate
      , ShipDate, SalesOrderNumber
      , Product, ListPrice, UnitPrice
      , OrderQty, LineTotal)
WHERE RecordType = 1;

After that ETL module can access new records for loading the data warehouse. Once done,  RecordType must be set to 3. Now how do we handle changes for last three months?

For that, we need to get last three months transactions from the source again and calculate the checksum for extracted data. If SSIS is used for loading the staging, then we can load data into a Data Flow and generate the new Checksum. If we use a method like bcp or TSQL, it is better to load them into another temporary table. Once we have the dataset with new checksum, records can be checked and seen by comparing new checksum and old checksum, and update the records in the staging as below;

UPDATE dbo.Sales
 SET OrderDate = t.OrderDate
  , ShipDate = t.ShipDate
  , SalesOrderNumber = t.SalesOrderNumber
  , Product = t.Product
  , ListPrice = t.ListPrice
  , UnitPrice = t.UnitPrice 
  , OrderQty = t.OrderQty
  , LineTotal = t.LineTotal
  , RecordType = 2
FROM dbo.Sales s
 INNER JOIN TempSales t
  ON s.SalesOrderID = t.SalesOrderID 
   AND s.CheckSumValue != Checksum(t.SalesOrderID, t.OrderDate
      , t.ShipDate, t.SalesOrderNumber
      , t.Product, t.ListPrice, t.UnitPrice
      , t.OrderQty, t.LineTotal);

This updates RecordType as 2 indicating that the record is changed. Once updated, another ETL can access these records and update FactSales in the data warehouse.

This reduces the complexity related to finding changes and time and resources required for processing. However, since we use Checksum, for certain situation, same value might generated for the new value even though there is a change. If you experience it, it is recommended to use Hasbytes instead of Checksum.


No comments: