Saturday, June 27, 2015

Ways of incrementally loading data to a data warehouse

Once I wrote a post on Incremental ETL (Incremental ETL: How to identify changes for fact data when no support from the source) that discusses how to use Checksum for identifying changes before transferring the extracted data from the source, specifically when source does not support identifying changed records. This raised a question: what are the other ways of incrementally loading a data warehouse, hence this post speaks about them.

There are multiple ways of loading incremental data into data warehouse. One has to be picked based on many factors such as performance, whether data warehouse accepts updates, whether data warehouse maintains history and whether staging environment is maintained (Do we need a staging database for warehousing projects?). Here are most common ways used;

CDC enabled source tables
If the source table is enabled with CDC (Changed Data Capture), then changes can be easily identified and warehouse table, either dimension or fact can be updated appropriately. If ETL is written with SSIS, CDC related tasks can be used for identifying changes and splitters can be used for identifying the type of the change such as INSERT, UPDATE, DELETE and the update the warehouse table accordingly.


The Slowly Changing Dimension transformation
This is the most common used one for loading dimension (not very large dimension) tables as it supports identifying changes comparing source and warehouse, doing inserts and updates, and most importantly handling type 2 type slowly changing dimensions. This does not require source maintaining high water mark and can be easily implemented using the wizard given with SSIS.


The MERGE statement
This T-SQL statement allows us to perform insert, update and delete operations using a single statement combining source and destination (warehouse). Since the code has to be executed with SQL Server engine, SQL Server instance that hosts the warehouse should be able to access the source. This accessibility is possible when source is in one of the databases in the same SQL Server instance or source can be linked as a linked server to SQL Server instance. The most common practice is, loading data from the source and performing necessary transformations using SSIS and then transformed data is loaded to a temporary table in the warehouse. A stored procedure in the warehouse then performs merging using MERGE statement.


Using Checksum
This technique is used when there is no way of identifying changes at the source and load is too heavy for Slowly Changing Transformation. This is usually implemented with staging environment and checksum value is used for identifying whether the record is changed or not. Read more on this at: http://dinesql.blogspot.com/2015/04/incremental-etl-how-to-identify-changes.html

The Lookup transformation
This SSIS transformation is used for smaller dimensions. This is not specifically given for handling this scenario but it can be used for comparing source data with existing data in the warehouse and find matching records.


Considering all five above, me preferred ones are Checksum and MERGE because they offer flexibility, maintainability and most importantly good performance.

No comments: