Thursday, July 28, 2016

SSIS - Fact Loading with Dimension Type II and Dimension Type I

The Dimensional Modeling in data warehousing, either using Star or Snowflake schema is still the most famous structure applied and widely used, though summarized tables using in-memory and columnstore concepts are slowly taking over the traditional method. It is very common to see that SSIS (or SQL Server Integration Services) is used for data loading, applying various patterns. Searches show different types of articles and posts but it is rare to find a complete post that talks about loading a Fact table using SSIS, looking up type I and II dimensions. That is the reason for this post, this talks about how to load a Fact Table, getting keys from both Type I and Type II dimensions.

Let's make a small data warehouse with Star Schema and populate some sample records. The following code creates a database called SalesDataWarehouse and then creates dimension tables named DimDate, DimCustomer, DimProduct, and DimEmployee. As you see, DimEmployee is a slowly changing dimension type II dimension.

  1. CREATE DATABASE SalesDataWarehouse;  
  2. GO  
  3.   
  4. -- Connect and create tables  
  5. USE SalesDataWarehouse;  
  6. GO  
  7.   
  8. CREATE TABLE dbo.DimDate  
  9. (  
  10.  DateKey int  primary key  
  11.  , [Datedate not null  
  12.  , [Monthvarchar(20) not null  
  13.  , MonthNumber smallint not null  
  14.  , QuarterName varchar(20) not null  
  15.  , QuarterNumber smallint not null  
  16.  , [Yearsmallint not null  
  17. );  
  18.   
  19. CREATE TABLE dbo.DimCustomer  
  20. (  
  21.  CustomerKey int identity(1,1) primary key  
  22.  , FullName varchar(100) not null  
  23.  , Town varchar(100) not null  
  24.  , SourceKey int not null  
  25. );  
  26.   
  27. CREATE TABLE dbo.DimProduct  
  28. (  
  29.  ProductKey int identity(1,1) primary key  
  30.  , Product varchar(100) not null  
  31.  , Brand varchar(100) not null  
  32.  , SourceKey int not null  
  33. );  
  34.   
  35. CREATE TABLE dbo.DimEmployee  
  36. (  
  37.  EmployeeKey int identity(1,1) primary key  
  38.  , Name varchar(100) not null  
  39.  , SalesTown varchar(100) not null  
  40.  , StartDate date not null  
  41.  , EndDate date  null  
  42.  , IsCurrent bit not null default (1)  
  43.  , SourceKey int not null  
  44. );  

The next code is for loading some sample records. Note that DimDate is loaded with WideWorldImportersDW database as its date dimension table contains records for the current year. However, this can be loaded using a simple script as well.

Have a look on records related to DimEmployee. Since it is a SCD Type II dimension, it holds history records. As you see, there are two records to James Phil with validity period.

  1. -- Inserting date records from AdventureWorks table  
  2. INSERT INTO dbo.DimDate  
  3.  (DateKey, [Date], [Month], MonthNumber, QuarterName, QuarterNumber, [Year])  
  4. SELECT CONVERT(intCONVERT(varchar(8), [Date], 112)), [Date], [Month], [Calendar Month Number]  
  5.  , 'Quarter ' + CONVERT(varchar(2), DATEPART(Q, [Date])), DATEPART(Q, [Date]), [Calendar Year]  
  6. FROM WideWorldImportersDW.Dimension.Date;  
  7.   
  8. -- Inserting records to Customer  
  9. INSERT INTO dbo.DimCustomer   
  10.  (FullName, Town, SourceKey)  
  11. VALUES  
  12.  ('Jane Knight'' Town 1', 100)  
  13.  , ('Jack walter''Town 2', 101);  
  14.   
  15. -- Inserting records to Product  
  16. INSERT INTO dbo.DimProduct   
  17.  (Product, Brand, SourceKey)  
  18. VALUES  
  19.  ('Product 1'' Brand 1', 22)  
  20.  , ('Product 2''Brand 3', 23);  
  21.   
  22. -- Inserting records to Customer  
  23. INSERT INTO dbo.DimEmployee  
  24.  (Name, SalesTown, StartDate, EndDate, IsCurrent, SourceKey)  
  25. VALUES  
  26.  ('Peter Kevin''Town 1''2016-06-01'null, 1, 200)  
  27.  , ('James Phil''Town 2''2016-06-01''2016-06-30', 0, 201)  
  28.  , ('James Phil''Town 3''2016-06-30'null, 1, 201);  


Next code is for creating the Fact table. Generally, Fact Table is a partitioned table. Therefore, this code creates a partition function and scheme for it. Note that, the best practice is, have different File Groups for partitions but let's just use PRIMARY (I will be discussing partition switching on fact loading with another post, extending the same code).

  1. -- creating the partition  
  2. CREATE PARTITION FUNCTION pf_SalesDate (int)  
  3. AS RANGE RIGHT  
  4. FOR VALUES (20160101, 20160201, 20160301, 20160401, 20160501, 20160601, 20160701)  
  5. GO  
  6.   
  7. -- creating the scheme  
  8. CREATE PARTITION SCHEME ps_SalesDate  
  9. AS PARTITION pf_SalesDate  
  10. TO ([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY])  
  11. GO  
  12.   
  13. CREATE TABLE FactSales  
  14. (  
  15.  DateKey int NOT NULL INDEX IX_FactSales CLUSTERED,  
  16.  ProductKey int NOT NULL,  
  17.  CustomerKey int NOT NULL,  
  18.  EmployeeKey int NOT NULL,  
  19.  SalesAmount money NOT NULL,  
  20.  SalesQuantity smallint NOT NULL  
  21. )  
  22. ON ps_SalesDate (DateKey)  
  23. GO  

Let's insert some sample records into the Fact Table too.

  1. INSERT INTO FactSales (DateKey, ProductKey, CustomerKey, EmployeeKey, SalesAmount, SalesQuantity)  
  2. VALUES  
  3.  (20160605, 1, 1, 1, 500, 2)  
  4.  , (20160615, 2, 2, 1, 500, 2)  
  5.  , (20160622, 2, 2, 2, 500, 2)  
  6.  , (20160624, 2, 1, 1, 500, 2)  

Done, now we have to create a SSIS package for loading new records. As you noted, we have records related to June 2016, and we will be loading records related to July 2016. Let's use following recordset for loading the Fact Table.

  1. SELECT CONVERT(date'2016-07-01'As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity  
  2. UNION ALL  
  3. SELECT CONVERT(date'2016-07-12'As TransactionDate, 23 AS ProductId, 101 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity  
  4. UNION ALL  
  5. SELECT CONVERT(date'2016-07-14'As TransactionDate, 23 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity  
  6. UNION ALL  
  7. SELECT CONVERT(date'2016-07-22'As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity  


Let's create a SSIS package and add a connection to the data warehouse. I have a database named Sales and will be using it as the source even though we do not extract records from the source. As you see, I have a DataFlow Task in the Control Flow.


Since we do not get surrogate key values from the source, we need to have Lookups for dimension tables for getting surrogate key values. As you see with the image below, I have added OLEDB Source to connect with Sales and four of Lookups for getting surrogate key values related to Date, Customer, Employee and Product.


Let's have a look on source first. It is configured with Sales database and the query above is used for populating required records.


There are three Type I type dimensions. Let's have a look on one and see how it needs to be checked. In order to get DateKey from the DimDate dimension, we need to look up DimDate by using TransactionDate that comes with the source. We need to configure the lookup to DataWarehouse connection and select the lookup table as DimDate. Once the connection is set, we need to link TransactionDate coming from the source with Date in the dimension. We need to select DateKey as it is the one we need when we insert records to the Fact Table. The selected DateKey is added to the flow automatically.


The similar configuration is required for DimCustomer and DimProduct for getting surrogate key values. We need to link CustomerID of the source to SourceKey in the DimCustomer table and get the CustomerKey. We need to link ProductID of the source to SourceKey in the DimProduct table and get the ProductKey. However, getting EmployeeKey is different as it is as SCD Type II dimension. Let's have a look on the configuration of Employee Lookup.

I will take you through the steps required for configuring this. First we need to make sure that Cache mode is set to either Partial Cache or No Cache because we are going to parameterize the lookup query.


Next is setting up the connection. We need to connect it to DataWarehouse and use a query for loading dimension records. Note the way it has been written. I have used question marks for parameterizing the query as we need to pick the right record related to the TransactionDate.


Just like the other dimensions, we need to match two columns from the source and dimension to find the record. However, additional link is required for getting values for parameters, hence need to link TransactionDate with StartDate though that is not the exact link we need.


You will see the reason for joining TransactionDate and StartDate with next step. We need to adjust the lookup query now, need to do it with Advanced section. It is required to adjust the query like below.


Once adjusted, need to set parameters like below. It allows us to select TransactionDate as a parameter value because the above join. Note that parameters a 0-based indexed.


All done, finally it needs the destination. Set OLEDB Destination for loading the FactSales as the last step the of the DataFlow task. Once all configurations are done, you can run the package and see how the FactSales has been loaded.




No comments: