Showing posts with label Fact Table. Show all posts
Showing posts with label Fact Table. Show all posts

Friday, July 29, 2016

Fact Loading: Switching a loaded temporary table into an empty partition of the fact table

Fact table in data warehouses are always partitioned and with most of solutions, it is partitioned on a date key. Generally, we use SSIS for loading dimensions and fact tables and ETLs written for loading the Fact Table can be optimized by loading data into another table and switching it to an empty partition of the fact table. This is one of the best practices in data warehousing data loading and technique for doing is not a complex task.

Assume you have a Fact Table that is partitioned by moths, example, one partition for 201601, another for 201602, and so on, and data is loaded till July 2016 (201607). Then when we load August data, we can load the data into a new table that has the same structure and then switch that table in to 201608 partition in the fact table for optimizing the data loading and minimizing the impact to the fact table. The below diagram shows it;


Here is the way of doing it. I will take the same Fact Table created with this post: SSIS - Fact Loading with Dimension Type II and Dimension Type I. You can use the same code for creating dimension tables and fact tables for trying out this. In that post, I have loaded data related to June and July 2016. Let's see how we can load August data using the technique discussed in this post.

1. Let's take this data set for loading August data.

SELECT CONVERT(date, '2016-08-01') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-08-12') As TransactionDate, 23 AS ProductId, 101 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-08-14') As TransactionDate, 23 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-08-22') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity

2. We need a different table with the same structure for loading processed data. The below code creates another table for holding data temporarily. Note that it has same columns and it is partitioned same as FactSales table.

CREATE TABLE FactSalesTemp
(
 DateKey int NOT NULL INDEX IX_FactSalesTemp CLUSTERED,
 ProductKey int NOT NULL,
 CustomerKey int NOT NULL,
 EmployeeKey int NOT NULL,
 SalesAmount money NOT NULL,
 SalesQuantity smallint NOT NULL
)
ON ps_SalesDate (DateKey)
GO

3. I will be using the same SSIS package used with my previous post. As you see, now the destination is set to newly created table which is FactSalesTemp.


4. Before loading data into new table, let's add another partition to both table for August data. All we need to do is, add a file group to the scheme and boundary value to the function;

-- Add another file group to the scheme 
ALTER PARTITION SCHEME ps_SalesDate
NEXT USED [PRIMARY]
GO

-- splitting the last partition by adding another boundary value
ALTER PARTITION FUNCTION pf_SalesDate ()
SPLIT RANGE (20160801)
GO

5. Now we can execute the SSIS package and load data into newly created table. Once the data set is loaded into FactSalesTemp, we can check both tables and how partitions are filled.

How partitions are filled;


How tables are filled;


6. As you see, data is loaded into newly created table and they are in partition 9. Now we need to switch the new table into FactSales 9th partition. Here is the way of doing it.

ALTER TABLE dbo.FactSalesTemp 
SWITCH PARTITION 9 TO dbo.FactSales PARTITION 9;

7. Now if you check the records in the table, you will see the FactSales is loaded with new data, and it is loaded to the correct partition.


You can have Altering scheme and function in the Control Flow as a Execute SQL Task before the Data Flow Task added. And, you can have Switching partition  with another Execute SQL Task just after the Data Flow Task for completing the SSIS package.

Thursday, July 28, 2016

SSIS - Fact Loading with Dimension Type II and Dimension Type I

The Dimensional Modeling in data warehousing, either using Star or Snowflake schema is still the most famous structure applied and widely used, though summarized tables using in-memory and columnstore concepts are slowly taking over the traditional method. It is very common to see that SSIS (or SQL Server Integration Services) is used for data loading, applying various patterns. Searches show different types of articles and posts but it is rare to find a complete post that talks about loading a Fact table using SSIS, looking up type I and II dimensions. That is the reason for this post, this talks about how to load a Fact Table, getting keys from both Type I and Type II dimensions.

Let's make a small data warehouse with Star Schema and populate some sample records. The following code creates a database called SalesDataWarehouse and then creates dimension tables named DimDate, DimCustomer, DimProduct, and DimEmployee. As you see, DimEmployee is a slowly changing dimension type II dimension.

CREATE DATABASE SalesDataWarehouse;
GO

-- Connect and create tables
USE SalesDataWarehouse;
GO

CREATE TABLE dbo.DimDate
(
 DateKey int  primary key
 , [Date] date not null
 , [Month] varchar(20) not null
 , MonthNumber smallint not null
 , QuarterName varchar(20) not null
 , QuarterNumber smallint not null
 , [Year] smallint not null
);

CREATE TABLE dbo.DimCustomer
(
 CustomerKey int identity(1,1) primary key
 , FullName varchar(100) not null
 , Town varchar(100) not null
 , SourceKey int not null
);

CREATE TABLE dbo.DimProduct
(
 ProductKey int identity(1,1) primary key
 , Product varchar(100) not null
 , Brand varchar(100) not null
 , SourceKey int not null
);

CREATE TABLE dbo.DimEmployee
(
 EmployeeKey int identity(1,1) primary key
 , Name varchar(100) not null
 , SalesTown varchar(100) not null
 , StartDate date not null
 , EndDate date  null
 , IsCurrent bit not null default (1)
 , SourceKey int not null
);

The next code is for loading some sample records. Note that DimDate is loaded with WideWorldImportersDW database as its date dimension table contains records for the current year. However, this can be loaded using a simple script as well.

Have a look on records related to DimEmployee. Since it is a SCD Type II dimension, it holds history records. As you see, there are two records to James Phil with validity period.

-- Inserting date records from AdventureWorks table
INSERT INTO dbo.DimDate
 (DateKey, [Date], [Month], MonthNumber, QuarterName, QuarterNumber, [Year])
SELECT CONVERT(int, CONVERT(varchar(8), [Date], 112)), [Date], [Month], [Calendar Month Number]
 , 'Quarter ' + CONVERT(varchar(2), DATEPART(Q, [Date])), DATEPART(Q, [Date]), [Calendar Year]
FROM WideWorldImportersDW.Dimension.Date;

-- Inserting records to Customer
INSERT INTO dbo.DimCustomer 
 (FullName, Town, SourceKey)
VALUES
 ('Jane Knight', ' Town 1', 100)
 , ('Jack walter', 'Town 2', 101);

-- Inserting records to Product
INSERT INTO dbo.DimProduct 
 (Product, Brand, SourceKey)
VALUES
 ('Product 1', ' Brand 1', 22)
 , ('Product 2', 'Brand 3', 23);

-- Inserting records to Customer
INSERT INTO dbo.DimEmployee
 (Name, SalesTown, StartDate, EndDate, IsCurrent, SourceKey)
VALUES
 ('Peter Kevin', 'Town 1', '2016-06-01', null, 1, 200)
 , ('James Phil', 'Town 2', '2016-06-01', '2016-06-30', 0, 201)
 , ('James Phil', 'Town 3', '2016-06-30', null, 1, 201);


Next code is for creating the Fact table. Generally, Fact Table is a partitioned table. Therefore, this code creates a partition function and scheme for it. Note that, the best practice is, have different File Groups for partitions but let's just use PRIMARY (I will be discussing partition switching on fact loading with another post, extending the same code).

-- creating the partition
CREATE PARTITION FUNCTION pf_SalesDate (int)
AS RANGE RIGHT
FOR VALUES (20160101, 20160201, 20160301, 20160401, 20160501, 20160601, 20160701)
GO

-- creating the scheme
CREATE PARTITION SCHEME ps_SalesDate
AS PARTITION pf_SalesDate
TO ([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY])
GO

CREATE TABLE FactSales
(
 DateKey int NOT NULL INDEX IX_FactSales CLUSTERED,
 ProductKey int NOT NULL,
 CustomerKey int NOT NULL,
 EmployeeKey int NOT NULL,
 SalesAmount money NOT NULL,
 SalesQuantity smallint NOT NULL
)
ON ps_SalesDate (DateKey)
GO

Let's insert some sample records into the Fact Table too.

INSERT INTO FactSales (DateKey, ProductKey, CustomerKey, EmployeeKey, SalesAmount, SalesQuantity)
VALUES
 (20160605, 1, 1, 1, 500, 2)
 , (20160615, 2, 2, 1, 500, 2)
 , (20160622, 2, 2, 2, 500, 2)
 , (20160624, 2, 1, 1, 500, 2)

Done, now we have to create a SSIS package for loading new records. As you noted, we have records related to June 2016, and we will be loading records related to July 2016. Let's use following recordset for loading the Fact Table.

SELECT CONVERT(date, '2016-07-01') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-12') As TransactionDate, 23 AS ProductId, 101 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-14') As TransactionDate, 23 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-22') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity


Let's create a SSIS package and add a connection to the data warehouse. I have a database named Sales and will be using it as the source even though we do not extract records from the source. As you see, I have a DataFlow Task in the Control Flow.


Since we do not get surrogate key values from the source, we need to have Lookups for dimension tables for getting surrogate key values. As you see with the image below, I have added OLEDB Source to connect with Sales and four of Lookups for getting surrogate key values related to Date, Customer, Employee and Product.


Let's have a look on source first. It is configured with Sales database and the query above is used for populating required records.


There are three Type I type dimensions. Let's have a look on one and see how it needs to be checked. In order to get DateKey from the DimDate dimension, we need to look up DimDate by using TransactionDate that comes with the source. We need to configure the lookup to DataWarehouse connection and select the lookup table as DimDate. Once the connection is set, we need to link TransactionDate coming from the source with Date in the dimension. We need to select DateKey as it is the one we need when we insert records to the Fact Table. The selected DateKey is added to the flow automatically.


The similar configuration is required for DimCustomer and DimProduct for getting surrogate key values. We need to link CustomerID of the source to SourceKey in the DimCustomer table and get the CustomerKey. We need to link ProductID of the source to SourceKey in the DimProduct table and get the ProductKey. However, getting EmployeeKey is different as it is as SCD Type II dimension. Let's have a look on the configuration of Employee Lookup.

I will take you through the steps required for configuring this. First we need to make sure that Cache mode is set to either Partial Cache or No Cache because we are going to parameterize the lookup query.


Next is setting up the connection. We need to connect it to DataWarehouse and use a query for loading dimension records. Note the way it has been written. I have used question marks for parameterizing the query as we need to pick the right record related to the TransactionDate.


Just like the other dimensions, we need to match two columns from the source and dimension to find the record. However, additional link is required for getting values for parameters, hence need to link TransactionDate with StartDate though that is not the exact link we need.


You will see the reason for joining TransactionDate and StartDate with next step. We need to adjust the lookup query now, need to do it with Advanced section. It is required to adjust the query like below.


Once adjusted, need to set parameters like below. It allows us to select TransactionDate as a parameter value because the above join. Note that parameters a 0-based indexed.


All done, finally it needs the destination. Set OLEDB Destination for loading the FactSales as the last step the of the DataFlow task. Once all configurations are done, you can run the package and see how the FactSales has been loaded.




Saturday, May 23, 2015

Fact Table: Clustered index on date key is enough or partitioning should be done with it?

Fact table in a data warehouse (or a large table in an OLTP database) always holds millions of records and we keep on adding records largely. Generally, it is discouraged to add indexes and constraints to Fact Tables but with modern way of implementing data warehouse / BI solutions, if the table is open for business users for analysis, having indexes on Fact Tables would be beneficial. However, whether it is a traditional implementation or modern implementation, it is still recommended to have a clustered index on most frequent key used for analysis and loading which is Date Key.

While discussing Design Considerations of Fact Tables, a question came up: Clustered index on date key is enough or should we partition the table based on date key too? Do we get additional performance benefits by partitioning on the same?

In terms of performance, if the table is not significantly large, I doubt that clustered index on date key and partitioning on the same will give extra performance benefits. But if it is a very large table, performance of data retrieval will be surely improved. Does this mean that Partitioning is not required for Fact Tables?

Let's analyze with a slightly large table. Below code shows a structure of InternetSales table that contains 7 millions records. It has no indexes, not even a clustered index.

-- Table structure
CREATE TABLE dbo.InternetSales(
 OrderDateKey int NOT NULL,
 ProductKey int NOT NULL,
 DueDateKey int NOT NULL,
 ShipDateKey int NOT NULL,
 CustomerKey int NOT NULL,
 PromotionKey int NOT NULL,
 CurrencyKey int NOT NULL,
 SalesTerritoryKey int NOT NULL,
 SalesOrderNumber nvarchar(20) NOT NULL,
 SalesOrderLineNumber tinyint NOT NULL,
 RevisionNumber tinyint NOT NULL,
 OrderQuantity smallint NOT NULL,
 UnitPrice money NOT NULL,
 ExtendedAmount money NOT NULL,
 UnitPriceDiscountPct float NOT NULL,
 DiscountAmount float NOT NULL,
 ProductStandardCost money NOT NULL,
 TotalProductCost money NOT NULL,
 SalesAmount money NOT NULL,
 TaxAmt money NOT NULL,
 Freight money NOT NULL,
 CarrierTrackingNumber nvarchar(25) NULL,
 CustomerPONumber nvarchar(25) NULL,
 OrderDate datetime NULL,
 DueDate datetime NULL,
 ShipDate datetime NULL
) ON PRIMARY;
GO

If business user runs a query like below against the table, you can witness the number of pages (IO) have to be read for satisfying the query.

SET STATISTICS IO ON
GO

SELECT 
 DATENAME(m,[OrderDate])
 , p.EnglishProductName Product
  , SUM(SalesAmount) AS TotalSales
  , SUM(TaxAmt) AS TotalTaxAmount
FROM dbo.InternetSales f
 INNER JOIN dbo.DimProduct p
  ON f.ProductKey = p.ProductKey
 INNER JOIN dbo.DimDate d
  ON f.OrderDateKey = d.DateKey
WHERE d.CalendarYear = 2013 
 AND d.EnglishDayNameOfWeek = 'Monday'
GROUP BY DATENAME(m,[OrderDate])
 , p.EnglishProductName;


If we add a clustered index on Date Key and then test the same;

CREATE CLUSTERED INDEX ix_InternetSales
ON InternetSales (OrderDateKey);
GO

SELECT 
 DATENAME(m,[OrderDate])
 , p.EnglishProductName Product
  , SUM(SalesAmount) AS TotalSales
  , SUM(TaxAmt) AS TotalTaxAmount
FROM dbo.InternetSales f
 INNER JOIN dbo.DimProduct p
  ON f.ProductKey = p.ProductKey
 INNER JOIN dbo.DimDate d
  ON f.OrderDateKey = d.DateKey
WHERE d.CalendarYear = 2013 
 AND d.EnglishDayNameOfWeek = 'Monday'
GROUP BY DATENAME(m,[OrderDate])
 , p.EnglishProductName;

We will see a significant difference between previous number of pages read and number of pages read now.



Now the question is, partitioning this table will give more benefits or not. Let's partition and see.

-- Create a partition function
CREATE PARTITION FUNCTION pf_InternetSalesDate (int)
AS RANGE RIGHT
FOR VALUES ('20100101', '20110101', '20120101', '20130101', '20140101')
GO

-- Create the scheme
CREATE PARTITION SCHEME ps_InternetSalesDate
AS PARTITION pf_InternetSalesDate
TO (HistoryFileGroup, HistoryFileGroup, HistoryFileGroup, HistoryFileGroup, HistoryFileGroup, CurrentFileGroup)
GO

-- Dropping clustered index for partitioning
DROP INDEX ix_InternetSales ON InternetSales;
GO

-- Recreating with Partitioning scheme
CREATE CLUSTERED INDEX ix_InternetSales
ON InternetSales (OrderDateKey)
ON ps_InternetSalesDate (OrderDateKey);
go

-- And executing the same Business user's query
SELECT 
 DATENAME(m,[OrderDate])
 , p.EnglishProductName Product
  , SUM(SalesAmount) AS TotalSales
  , SUM(TaxAmt) AS TotalTaxAmount
FROM dbo.InternetSales f
 INNER JOIN dbo.DimProduct p
  ON f.ProductKey = p.ProductKey
 INNER JOIN dbo.DimDate d
  ON f.OrderDateKey = d.DateKey
WHERE d.CalendarYear = 2013 
 AND d.EnglishDayNameOfWeek = 'Monday'
GROUP BY DATENAME(m,[OrderDate])
 , p.EnglishProductName;

Here is the result of it;


This result does not show much improvement when compare with previous result, hence one can come to a conclusion saying that partitioning is not required when a clustered index is created on date key. But even with this table, we see a slight improvement, means that a large table will definitely benefits with partitioning.

Okay, now what about this table? should we keep partitioning or not? I prefer to have both implementations because Partitioning gives few more benefits in addition to the performance benefit it offers. It offers More Granular Manageability, letting us perform some operation such as indexing, compression and backing up at partition level instead of on the whole table. It improves Data Loading since loading can be managed at partition levels. Considering all of these, should not we create the clustered index on date key as well as partition the table using date key?