Dinesh's Blog :::: Being Compiled ::::: Dimensional Modeling

Showing posts with label Dimensional Modeling. Show all posts

Thursday, July 28, 2016

SSIS - Fact Loading with Dimension Type II and Dimension Type I

The Dimensional Modeling in data warehousing, either using Star or Snowflake schema is still the most famous structure applied and widely used, though summarized tables using in-memory and columnstore concepts are slowly taking over the traditional method. It is very common to see that SSIS (or SQL Server Integration Services) is used for data loading, applying various patterns. Searches show different types of articles and posts but it is rare to find a complete post that talks about loading a Fact table using SSIS, looking up type I and II dimensions. That is the reason for this post, this talks about how to load a Fact Table, getting keys from both Type I and Type II dimensions.

Let's make a small data warehouse with Star Schema and populate some sample records. The following code creates a database called SalesDataWarehouse and then creates dimension tables named DimDate, DimCustomer, DimProduct, and DimEmployee. As you see, DimEmployee is a slowly changing dimension type II dimension.

CREATE DATABASE SalesDataWarehouse;
GO

-- Connect and create tables
USE SalesDataWarehouse;
GO

CREATE TABLE dbo.DimDate
(
 DateKey int  primary key
 , [Date] date not null
 , [Month] varchar(20) not null
 , MonthNumber smallint not null
 , QuarterName varchar(20) not null
 , QuarterNumber smallint not null
 , [Year] smallint not null
);

CREATE TABLE dbo.DimCustomer
(
 CustomerKey int identity(1,1) primary key
 , FullName varchar(100) not null
 , Town varchar(100) not null
 , SourceKey int not null
);

CREATE TABLE dbo.DimProduct
(
 ProductKey int identity(1,1) primary key
 , Product varchar(100) not null
 , Brand varchar(100) not null
 , SourceKey int not null
);

CREATE TABLE dbo.DimEmployee
(
 EmployeeKey int identity(1,1) primary key
 , Name varchar(100) not null
 , SalesTown varchar(100) not null
 , StartDate date not null
 , EndDate date  null
 , IsCurrent bit not null default (1)
 , SourceKey int not null
);

The next code is for loading some sample records. Note that DimDate is loaded with WideWorldImportersDW database as its date dimension table contains records for the current year. However, this can be loaded using a simple script as well.

Have a look on records related to DimEmployee. Since it is a SCD Type II dimension, it holds history records. As you see, there are two records to James Phil with validity period.

-- Inserting date records from AdventureWorks table
INSERT INTO dbo.DimDate
 (DateKey, [Date], [Month], MonthNumber, QuarterName, QuarterNumber, [Year])
SELECT CONVERT(int, CONVERT(varchar(8), [Date], 112)), [Date], [Month], [Calendar Month Number]
 , 'Quarter ' + CONVERT(varchar(2), DATEPART(Q, [Date])), DATEPART(Q, [Date]), [Calendar Year]
FROM WideWorldImportersDW.Dimension.Date;

-- Inserting records to Customer
INSERT INTO dbo.DimCustomer 
 (FullName, Town, SourceKey)
VALUES
 ('Jane Knight', ' Town 1', 100)
 , ('Jack walter', 'Town 2', 101);

-- Inserting records to Product
INSERT INTO dbo.DimProduct 
 (Product, Brand, SourceKey)
VALUES
 ('Product 1', ' Brand 1', 22)
 , ('Product 2', 'Brand 3', 23);

-- Inserting records to Customer
INSERT INTO dbo.DimEmployee
 (Name, SalesTown, StartDate, EndDate, IsCurrent, SourceKey)
VALUES
 ('Peter Kevin', 'Town 1', '2016-06-01', null, 1, 200)
 , ('James Phil', 'Town 2', '2016-06-01', '2016-06-30', 0, 201)
 , ('James Phil', 'Town 3', '2016-06-30', null, 1, 201);

Next code is for creating the Fact table. Generally, Fact Table is a partitioned table. Therefore, this code creates a partition function and scheme for it. Note that, the best practice is, have different File Groups for partitions but let's just use PRIMARY (I will be discussing partition switching on fact loading with another post, extending the same code).

-- creating the partition
CREATE PARTITION FUNCTION pf_SalesDate (int)
AS RANGE RIGHT
FOR VALUES (20160101, 20160201, 20160301, 20160401, 20160501, 20160601, 20160701)
GO

-- creating the scheme
CREATE PARTITION SCHEME ps_SalesDate
AS PARTITION pf_SalesDate
TO ([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY])
GO

CREATE TABLE FactSales
(
 DateKey int NOT NULL INDEX IX_FactSales CLUSTERED,
 ProductKey int NOT NULL,
 CustomerKey int NOT NULL,
 EmployeeKey int NOT NULL,
 SalesAmount money NOT NULL,
 SalesQuantity smallint NOT NULL
)
ON ps_SalesDate (DateKey)
GO

Let's insert some sample records into the Fact Table too.

INSERT INTO FactSales (DateKey, ProductKey, CustomerKey, EmployeeKey, SalesAmount, SalesQuantity)
VALUES
 (20160605, 1, 1, 1, 500, 2)
 , (20160615, 2, 2, 1, 500, 2)
 , (20160622, 2, 2, 2, 500, 2)
 , (20160624, 2, 1, 1, 500, 2)

Done, now we have to create a SSIS package for loading new records. As you noted, we have records related to June 2016, and we will be loading records related to July 2016. Let's use following recordset for loading the Fact Table.

SELECT CONVERT(date, '2016-07-01') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-12') As TransactionDate, 23 AS ProductId, 101 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-14') As TransactionDate, 23 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-22') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity

Let's create a SSIS package and add a connection to the data warehouse. I have a database named Sales and will be using it as the source even though we do not extract records from the source. As you see, I have a DataFlow Task in the Control Flow.

Since we do not get surrogate key values from the source, we need to have Lookups for dimension tables for getting surrogate key values. As you see with the image below, I have added OLEDB Source to connect with Sales and four of Lookups for getting surrogate key values related to Date, Customer, Employee and Product.

Let's have a look on source first. It is configured with Sales database and the query above is used for populating required records.

There are three Type I type dimensions. Let's have a look on one and see how it needs to be checked. In order to get DateKey from the DimDate dimension, we need to look up DimDate by using TransactionDate that comes with the source. We need to configure the lookup to DataWarehouse connection and select the lookup table as DimDate. Once the connection is set, we need to link TransactionDate coming from the source with Date in the dimension. We need to select DateKey as it is the one we need when we insert records to the Fact Table. The selected DateKey is added to the flow automatically.

The similar configuration is required for DimCustomer and DimProduct for getting surrogate key values. We need to link CustomerID of the source to SourceKey in the DimCustomer table and get the CustomerKey. We need to link ProductID of the source to SourceKey in the DimProduct table and get the ProductKey. However, getting EmployeeKey is different as it is as SCD Type II dimension. Let's have a look on the configuration of Employee Lookup.

I will take you through the steps required for configuring this. First we need to make sure that Cache mode is set to either Partial Cache or No Cache because we are going to parameterize the lookup query.

Next is setting up the connection. We need to connect it to DataWarehouse and use a query for loading dimension records. Note the way it has been written. I have used question marks for parameterizing the query as we need to pick the right record related to the TransactionDate.

Just like the other dimensions, we need to match two columns from the source and dimension to find the record. However, additional link is required for getting values for parameters, hence need to link TransactionDate with StartDate though that is not the exact link we need.

You will see the reason for joining TransactionDate and StartDate with next step. We need to adjust the lookup query now, need to do it with Advanced section. It is required to adjust the query like below.

Once adjusted, need to set parameters like below. It allows us to select TransactionDate as a parameter value because the above join. Note that parameters a 0-based indexed.

All done, finally it needs the destination. Set OLEDB Destination for loading the FactSales as the last step the of the DataFlow task. Once all configurations are done, you can run the package and see how the FactSales has been loaded.

Sunday, April 6, 2014

Reasons for adding a surrogate key for dimension tables

As a best practice, we always add a new key, known as a surrogate key to dimension tables for identifying the records uniquely. However, generally, the unique business key becomes the primary key of relational tables related to business entities but once the surrogate key is added to dimension tables, we make it as the primary key without reusing the business key. Here is an example from AdventureWorksDW data warehouse.

Surrogate key is an integer that is assigned sequentially when records are added. There are many other names for this such as artificial key, synthetic key, meaningless key, integer key or non-natural key. Do we really need this? Does it seem inappropriate using the business keys for primary keys? It may seem sensible for reusing, but here are some valid reasons for not using business keys but using surrogate keys;

Dimension is populated from multiple sources
In many cases, dimensions are populated from multiple sources. This always introduces incompatibility between business keys’ data types originating from sources. Not only that, the uniqueness of the business key is not always guaranteed too.
Business key is reassigned when an object becomes obsolete
OLTP systems always focus on current operations giving less importance to the history. If an object becomes inactive or obsolete (Example: Organization decides to discontinue a product), there is a high probability to get the business key used assigned to a new object. If the business key is used as the key of the dimension, this becomes an issue because dimension already holds an object with the same business key.
Business key is a complex string or GUID
Business keys used in source systems could be formed combining numbers and different characters or with GUIDs. In such scenario, there is no harm of using the business key as the primary key of the dimension table as long as the uniqueness is guaranteed. However, in terms of performance, it makes more effective to have an integer surrogate key than a complex string.
Integer improves the performance
As mentioned with reason 3, it is always better to have an integer key as the key of the dimension even though the business key is a simple string. This improves the efficiency of the table and speeds up the data retrieval when joining with facts.
Objects without business keys
When populating dimension records, there could be a situation that object needs to be populated without a business key. Good example on this is, holding a newly introduced product in the Product dimension, before extracting it from the source, for supporting a transaction exist in the fact table. Another example is, holding an anonymous customer in the Customer dimension for supporting promotion events recorded in the fact table. It is become possible for holding these objects in dimension tables as “Unknown”, “Not applicable”, or “Anonymous” when surrogate key is used.
Objects are versioned in dimension tables
In order to maintain the history in dimension tables, same objects are versioned by adding multiple records (These dimensions are called as Slowly Changing Dimension Type 2). Example: Customer is recorded with his country Sri Lanka. Later customer becomes a citizen of Australia. In OLTP systems, it is just a change for country column. But in data warehousing, for maintaining the history for events recorded (facts), a new version (new record) of the customer is added with Australia for the country. This is not possible if the business key of the customer is used as the key of the dimension.

Of course, the maintenance cost gets increased and a mechanism has to be introduced for generating surrogate keys and linking them with foreign keys in fact tables. But it is not as difficult as many think because there are many built-in facilities available in DBMS and ETL systems.

Thursday, March 13, 2014

Who invented dimensional modeling?

Who introduced or invented Dimensional Modeling? Was it invented by Ralph Kimball? Since he completely aligns with Dimensional Modeling in his business intelligence strategies, many believe that it was from him. Unfortunately, it is one of the misperceptions persist in the industry related to Dimensional Modeling.

What is Dimensional Modeling?
Dimensional Modeling is a technique that has been used for decades for structuring databases in simple manner, supporting business users’ requirements such as understanding data, navigating through them easily, and running both simple and complex data analysis queries. Dimensional model designed with relational database management systems is called as Start-Schemas and dimensional model designed with multi-dimensional databases is called as OLAP cubes.

Who invented this?
History is limited on this, however, as per the record exist;

The terms “Facts” and “Dimension” were introduced with a joint project developed by General Mills and Dartmouth University in 1960’s.
During 1970’s, Nielsan Marketing Research team used these techniques for grocery and drug store data.
In 1980’s Nielsen Marketing Research and IRI used grocery and drug store scanner data to link with customer’s internal shipment data.

As per the history, no single person holds the credit of introducing the concept but it is something that have been evolved with contributions of many.

Read more about Dimensional Modeling: http://en.wikipedia.org/wiki/Dimensional_modeling

Friday, January 24, 2014

Business Intelligence: Dimensional Modeling - Do we need Operational Codes as Dimension Attributes?

Business User: I need the product code as product name, whenever I use product dimension, system should default to the code, not the name, we all are familiar with codes, not names.

Business Analyst: Will it be an issue for other users? Specifically the top management? I am sure that they do not want to see the revenue by code but revenue by product name.

Business User: I do not think that it is an issue, besides, we will be using the system not them, they will check main KPIs occasionally.

Business Analyst: BI solution is for all, not for one level, we will have the code for product dimension but let’s default to the name not the code, users who are familiar with codes can still go through codes, system facilitates …………

Business User: I think you don’t get what I say, isn't your duty managing users’ requirements……….

Okay, this could be a sort of conversation you will be having during the requirement gathering process or you might already have had something like this. How do you manage it? How did you manage it? What should be the best way of implementing this?

Dimension attributes are the keys of business intelligence solutions. The usability and understandability of attributes (same goes to dimensions too) measures the success of business intelligence solutions. If business users cannot understand or they cannot use the existing attributes for their analysis, the rejection rate goes up, making the solution less gravitative.

Here is an example: General Date dimension contains Year, Quarter, Month and Date. Some solution requires Week too. Once implemented, if bi-weekly analysis appears as another requirement, how it should be addressed? One solution is, have the week as the filter and filter out weeks for making its appearance as bi-weekly attribute. Though we provide a solution, business user sees it as technical solution, not as a business solution. The demand for bi-weekly attribute comes as no surprise. In the absence of business-user-aligned attributes, in the absence of business user’s most friendly attributes, solution becomes less useful.

Let’s go back to the conversation. Business Users insists the codes as the default attribute for product dimension. In most cases, the reason is, she/he comes from operational world. However, one user’s or one department’s requirement does not represent the entire organization’s requirement. Business Intelligence solution is not for a single person, not for one department. Arguing on this is not the way but explaining the advantages of usage of user-friendly textual, descriptive words for attributes will sort the matter. Explaining how these codes are inevitable for unnecessary inconsistencies will sort the matter. This does not mean that attributes such as product codes should completely ignored. This could be useful when preparing specific reports to operational business processes, this could be useful when communicating back to operational sources. Therefore make sure all attributes related to dimension are implemented and make most meaningful attribute as the default attribute. Not only that, make sure you spend more time on understanding attributes, naming them with accurate business terminology, and make them more robust.

Do not forget that neither measure or KPI can be analyzed without dimensions attributes. That is the starting point of all analysis, means that the success of a Business Intelligence solution is determined by dimensions’ attributes.