Showing posts with label Data Warehousing. Show all posts
Showing posts with label Data Warehousing. Show all posts

Saturday, April 2, 2022

Azure Synapse Analytics - Part III - Serverless SQL Pool



Azure Synapse Serverless SQL Pool facilitates data exploration, transformations and data warehousing with multiple functionalities, allowing us to work with it using SQL. This video discusses how it works, what we can do and demos on use cases. This is the content:

  • Introduction to Serverless SQL Pool
  • How useful it is
  • Is it same as Dedicated SQL Pool?
  • Demo
    • Data exploration
    • Data transformation
    • Making a logical data warehouse and accessing it using Power BI

Read more at:
Serverless SQL Pool Architecture Serverless SQL Pool - Microsoft Learn

Tuesday, April 2, 2019

Azure Databricks - What, Why & How - YouTube Video

Here is the first video on Azure Databricks. This talks about;

  • Why we need to Azure Databricks for processing data
  • What is Azure Databricks
  • Demo: How to provision an Azure Databricks workspace, a cluster, a database and a table.



Thursday, January 4, 2018

Introduction to Azure Data Lake Store - Does it replace the Data Warehouse?

I have made a video on Azure Data Lake and Azure Data Lake Store and published in YouTube. This video explains What is Azure Data Lake, Its Characteristics, Components of it, How it works, Usage, Comparison between Data Lake Store and Blob Storage, Modern Data Warehousing and How can we use Azure Data Lake with modern data warehousing.



Have a look on it and let me know you thoughts or questions you have.

Monday, September 11, 2017

SQL Server - Merging Partitions - Does it physically move data from a data file to another?

Partitioning is a common practice when maintaining large number of records in tables regardless of the database type: whether the database is an OLTP database or a data warehouse based on dimensional modeling. Partitioning requires a Partition Function that describes boundary values and number of partitions required, and Partition Scheme that assigns File Groups to partitions.

Here are few posts I have made on partitioning;

Once a table is partitioned, based on the workloads, we add new boundary values for introducing new partitions to the table and we remove boundary values for combining partitions (or removing partitions). These two operations are done with two partition related functions: SPLIT and MERGE. Now the question is, when we remove a partition using the MERGE function, if partitions are distributed with different file groups (means different data files), does SQL Server moves data from one file to another? Do we really have to consider it?

Let's make a database and do a simple test. Let's create a database called AdventureWorksTest and add multiple file groups and data files.

-- Creating the test database
USE master;
GO

CREATE DATABASE AdventureWorksTest
GO

-- Add multiple file group and files
-- files are distributed
-- Add filegroup 1
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup1;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_01,
    FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\AdventureWorksTest_Data_01.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup1;

-- Add filegroup 2
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup2;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_02,
    FILENAME = 'C:\Databases\AdventureWorksTest_Data_02.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup2;

-- Add filegroup 3
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup3;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_03,
    FILENAME = 'D:\Databases\AdventureWorksTest_Data_03.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup3;

-- Add filegroup 4
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup4;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_04,
    FILENAME = 'E:\Databases\AdventureWorksTest_Data_04.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup4;
GO

As you see, we have five file groups now.


Let's create a Partitioned Table and load it using AdventureWorks2014 Sales.SalesOrderHeader table. Note that, below code creates a Partition Function with 2014, 2015, 2016 and 2017 as boundary values. And it creates the Partition Scheme setting with multiple file groups.

-- Creating parition function and scheme
USE AdventureWorksTest;
GO

CREATE PARTITION FUNCTION pf_Year (datetime) 
AS RANGE right
FOR VALUES ('2014-01-01 00:00:00.000', '2015-01-01 00:00:00.000',  '2016-01-01 00:00:00.000',  '2017-01-01 00:00:00.000');
GO

CREATE PARTITION SCHEME ps_Year 
AS PARTITION pf_Year 
TO ([PRIMARY], FileGroup1, FileGroup2, FileGroup3, FileGroup4);
GO



-- Creating a partitioned table and loading from AdventureWorks2014 database
-- Note that sales dates of AdventureWorks2014 database have been updated for year 2014 to 2017
CREATE TABLE Sales
(
 SalesOrderID int NOT NULL,
 RevisionNumber tinyint NOT NULL DEFAULT (0),
 OrderDate datetime NOT NULL DEFAULT GETDATE(),
 DueDate datetime NOT NULL,
 ShipDate datetime NULL,
 [Status] tinyint NOT NULL DEFAULT(1),
 SubTotal money NOT NULL DEFAULT(0.00),
 TaxAmt money NOT NULL DEFAULT (0.00),
 Freight money NOT NULL DEFAULT (0.00),
 TotalDue money,
 Comment nvarchar(128) NULL,
 rowguid uniqueidentifier ROWGUIDCOL NOT NULL DEFAULT NEWID(),
 ModifiedDate datetime NOT NULL DEFAULT GETDATE()
) 
ON ps_Year(OrderDate);
GO

INSERT INTO Sales
  (SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, [Status]
  , SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate)
SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, [Status]
  , SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate
FROM AdventureWorks2014.Sales.SalesOrderHeader;

-- Checking records;
SELECT  $PARTITION.pf_Year(OrderDate) PartitionNo, *
FROM Sales;

Here is the result of the query.


Here is another query to see how data is distributed with each file group.

-- Get the record count for each partition/file group
SELECT t.name AS TableName,
  p.partition_number PartitionNumber,
  COALESCE(f.name, d.name) FileGroupName, 
  SUM(p.rows) NumberOfRecords,
  SUM(a.total_pages) NumberOfDataPages
FROM sys.tables AS t
 INNER JOIN sys.indexes i 
  ON i.object_id = t.object_id
 INNER JOIN sys.partitions p 
  ON p.object_id = t.object_id 
   AND p.index_id = i.index_id
 INNER JOIN sys.allocation_units a 
  ON a.container_id = p.partition_id
 LEFT OUTER JOIN sys.filegroups f 
  ON f.data_space_id = i.data_space_id
 LEFT OUTER JOIN sys.destination_data_spaces AS dds 
  ON dds.partition_scheme_id = i.data_space_id 
   AND dds.destination_id = p.partition_number
 LEFT OUTER JOIN sys.filegroups d 
  ON d.data_space_id = dds.data_space_id
WHERE t.[type] = 'U' AND i.index_id IN (0, 1) AND t.name LIKE 'Sales'
GROUP BY t.name, p.partition_number, COALESCE(f.name, d.name) 
ORDER BY t.name, p.partition_number;


If I illustrate the same in different manner;


This is how the data is distributed with my system now. If I try to delete a data file, example, AdventureWorksTest_Data_03 that is belong to FileGroup3, SQL Server will not allow me to do as it holds data.


If I decide to combine 2015 and 2016, in other words, if I want to remove the 4th partition that holds 2016 data, what will happen. Let's do and see. The below codes merge the partition;

ALTER PARTITION FUNCTION pf_Year()
MERGE RANGE ('2016-01-01 00:00:00.000');

Once delete, let me run the same query to see how data is partitioned now.


As you see, FileGroup3 is no longer used by the table and number of records in the FileGroup2 has been increased. The partition which was 5 is now 4. This means, data that was maintained with FileGroup3 (or AdventureWorksTest_Data_03) has been moved to FileGroup2 (or AdventureWorksTest_Data_02). This is what has happened;


This clearly shows that data is moved when partitions are merged. Okay, do we have to really consider about this? Yes, we have to, when this is a small table, we do not see any performance issue but if the table is large, moving data from one file to another will surely take time, hence, think twice before merging, specifically when they are distributed in multiple files.

** Now, if you need, you can delete the AdventureWorksTest_Data_03 file and SQL Server will allow you to delete as it is empty.

Thursday, September 7, 2017

Dimension with just one attribute - data warehouse design

Have you ever come across a situation where you have to design a dimension that has no attributes except the one explain the dimension itself? I am sure that you have experienced such dimensions and you might have created a table in relational data warehouse for maintaining it. But is it the best way of managing such dimensions?

Assume that you have a fact table that maintains Issues and it is linked with multiple dimensions such as Date, Warehouse and Customer. And you have identified Issue Number as another dimension as there is a requirement for analyzing claims by issue number. You may design your data warehouse as below.


As you see, there are no other identified attributes for Issue Number dimension. Do we have an alternate way of designing it?

This is where we use a dimension type called Degenerated Dimension. Whenever we see such dimensions and we know for sure that it has only one attribute and there will be always a unique value to a transaction in the fact, it is always better to add it to the fact table itself rather maintaining another table. The dimension that is maintained in the fact is called as Degenerated Dimension.


Monday, August 14, 2017

Azure SQL Data Warehouse - Part II - Analyzing Created Databases

I started writing on this long time back but had no way of completing it. My first post on this was Azure SQL Data Warehouse - Part I - Creating and Connecting and here is the second part of it.

The first post discussed about Azure SQL Data Warehouse and how to create a server and database using the portal. As the next step, let's talk about the architecture of it bit and then see how Azure data warehouse maintains and processes data.

Control Node

Azure SQL Data Warehouse is a distributed database. It means that data is distributed in multiple locations. However, once the data warehouse is created, we will be connecting with one component called Control Node. It is not exactly a SQL Server database but when connecting to it, it looks and feels like connecting to a SQL Server Database. Control node handles all communication and computation. When we make a request to the data warehouse, Control node accepts it, determines the way it should be distributed based on divide and conquer approach, get it processed and finally send the result to us.

Compute Node

Control node get the data processed in parallel using Compute Nodes. They are SQL Server databases and store all our records. Based on the number of DWU configured, data warehouse is set with one or more Compute Nodes.

Distribution

Data related to the data warehouse is stored in Azure Blob Storage and distributed in multiple locations. It is independent from Compute Nodes, hence they can be operated/adjusted independently. These locations are called as Distributions. The number of distributions for an Azure SQL data warehouse is a fixed number that is sixty (60). These distributions are assigned dynamically to Compute Nodes and when a query is executed, each distribution processes data related to them. This is how the parallel execution happens.

If you need more compute power, you can increase the number of Compute Nodes by increasing DWUs. When the number of Compute Nodes are getting changed, the number of distributions per Compute Node is getting changed as well.

Architecture

This image shows the architecture when you create an Azure SQL Data Warehouse with 100 DWU.


This image shows the architecture when you create an Azure SQL Data Warehouse with 400 DWU.


Let's create two databases and clarify this configuration.

I have discussed all steps related to server creation and database creation in my first post that is  Azure SQL Data Warehouse - Part I - Creating and Connecting , hence I am not going to discuss the same here. Note the image below. That is how I have created two databases for checking the configurations.


As you see, the first data warehouse is created using 100 DWUs and second one with 400 DWUs. Let's see how nodes have been created for these two databases. We can simply use sys.dm_pdw_nodes DMV for getting this information.

SELECT * FROM sys.dm_pdw_nodes;

SELECT type, Count(*) NofNodes
FROM sys.dm_pdw_nodes
GROUP BY type;

Result with the first data warehouse that is created using 100 DWUs.


Result with the second data warehouse that is created using 400 DWUs.


Note the second data warehouse. Since we used more DWUs, it has been created with four Compute Nodes that gives better performance than the first one. Since this is a sample database and it has tables, we can check one of the tables and see how data is distributed with distributions.

The following code shows the distributions created for one data warehouse. As mentioned above, it is always 60.


Here is the code for seeing how rows of a table are distributed in distributions with the second data warehouse created. Note how each distributions are assigned to Compute Nodes.


Records are distributed based on the design of the table. Azure SQL Data Warehouse uses two types of distributions: Round Robin and Hash distributions. Let's talk about it with the next post.

Friday, August 4, 2017

bcp error - Invalid column type from bcp client for colid 1 - Solution

You might be experiencing the following error with bcp when loading text data to SQL Server table;


It is possible to see this error when there is a data type mismatch between a column related to the destination table and a field coming from the file. However, there can be a situation where you have used a view for the bcp and underline table has been changed. If so, you may see the same error.

In my case, my destination table has an identity column, hence I created a view for getting data via bcp. I started seeing this error after making some modifications to the table. Once the view is dropped and recreated, it worked without any issue.

Thursday, August 3, 2017

Script for creating DimTime Dimension Table and populating data

The posts I made on date dimension table (Script for populating Date Dimension with Financial Year and Loading Data Script for Snowflake type Date Dimension) had a good response and few asked about the same for Time related table. This is one of the old scripts I used for populating Time Dimension, it maintains up to the second but if you need more, the script can be easily change.

Here is the code for creating the table and populating data.

CREATE TABLE dbo.Time
(
 TimeKey int identity(1,1) 
 , Constraint pk_Time primary key (TimeKey)
 , [Hour] smallint not null
 , [24Hour] smallint not null
 , [Minute] smallint not null
 , [Second] smallint not null
 , [Time] char(10) not null
 , [24Time] char(8) not null
 , HourBucket char(9)
);
GO

DECLARE @Hour smallint = 0
DECLARE @Minute smallint = 0
DECLARE @Second smallint = 0

WHILE (@Hour < 24)
BEGIN

 WHILE (@Minute < 60)
 BEGIN

  WHILE (@Second < 60)
  BEGIN

   INSERT INTO dbo.[Time]
    ([Hour], [24Hour], [Minute], [Second], [Time], [24Time], HourBucket)
   VALUES
    (CASE WHEN @Hour > 12 THEN @Hour - 12 ELSE @Hour END
    , @Hour, @Minute, @Second
    , REPLICATE('0', 2 - LEN(CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END)) + CONVERT(varchar(2), CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
     + CASE WHEN @Hour > 12 THEN 'PM' ELSE 'AM' END
    , REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
    , '[' + REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ' - ' 
     + REPLICATE('0', 2 - LEN(@Hour+1)) + CONVERT(varchar(2), @Hour+1) + ']')
   SET @Second += 1;
  END
  SET @Second = 0;
  SET @Minute += 1;
 END
 SET @Minute = 0;
 SET @Hour += 1;
END
GO

SELECT * FROM dbo.[Time];


Tuesday, August 1, 2017

Script for populating Date Dimension with Financial Year

Once I published the code I used for creating date dimension tables and populating data. But it had not handled finance year related date elements. Generally we handle financial year elements with Date Table with Star Schema not with Snowflake Schema. And in most cases (or old days), we usually use OLAP Data Warehouse  as the source for reporting and Analysis Services can easily handle financial dates with Start Schema implemented in the Relational Data Warehouse. However, in modern world, we try to make the solution just using the Relation Data Warehouse without using Multi-dimensional models (or using Tabular Models), thought make a another script for handling financial dates with Snowflake Schema.

With this script, four tables are created: Year, Quarter, Month and Date. If need to see elements related to the financial year, the link between Year and Quarter should be set with Year.YearKey and Quarter.FinancialYearKey. For calendar date elements, the link should be between Year.YearKey and Quarter.YearKey.

This is how you make a search for financial dates;


This is how you make a search for Calendar dates;


Here is the script for creating tables;

CREATE TABLE dbo.DimYear
(
 YearKey smallint 
 , Year smallint not null
 , Constraint pk_DimYear Primary Key (YearKey)
);
GO

CREATE TABLE dbo.DimQuarter
(
 QuarterKey smallint 
 , YearKey smallint not null
 , FinancialYearKey smallint not null
 , Quarter smallint not null
 , YearQuater varchar(20) not null
 , QuarterName varchar(20) not null
 , FinancialQuarter smallint not null
 , FinancialYearQuarter varchar(20) not null
 , Constraint pk_DimQuarter Primary Key (QuarterKey)
 , Constraint fk_DimQuarter_DimYear Foreign Key (YearKey)
  References dbo.DimYear (YearKey)
 , Constraint fk_DimQuarter_DimYear_Financial Foreign Key (FinancialYearKey)
  References dbo.DimYear (YearKey)
);
GO

CREATE TABLE dbo.DimMonth
(
 MonthKey int 
 , QuarterKey smallint not null
 , MonthNumber smallint not null
 , MonthName varchar(20) not null
 , YearMonth varchar(20) not null
 , MonthShortName char(3) not null
 , FinancialMonthNumber smallint not null
 , Constraint pk_DimMonth Primary Key (MonthKey)
 , Constraint fk_DimMonth_DimQuarter Foreign Key (QuarterKey)
  References dbo.DimQuarter (QuarterKey)
);
GO

CREATE TABLE dbo.DimDate
(
 DateKey int 
 , MonthKey int not null
 , Date date not null
 , WeekDayNumber smallint not null
 , WeekDayName varchar(20) not null
 , DayOfMonth smallint not null
 , DayOfYear smallint not null
 , IsWeekend bit not null
 , IsHoliday bit not null
 , WeekNumberOfYear smallint not null
 , Constraint pk_DimDate Primary Key (DateKey)
 , Constraint fk_DimDate_DimMonth Foreign Key (MonthKey)
  References dbo.DimMonth (MonthKey)
);
GO

Here is the script for populating dates;

SET DATEFIRST 1;

DECLARE @StartDate date = '1990-01-01';
DECLARE @EndDate date = '2019-12-31';
DECLARE @FinancialYearStartingQuarter smallint = 3; -- Starting from July, If it starts from April then 2

DECLARE @YearKey smallint;
DECLARE @Quarter smallint, @QuarterKey smallint;
DECLARE @Month int, @MonthKey int;
DECLARE @Date int, @DateKey int;

DECLARE @FinancialStartingMonth smallint;
SET @FinancialStartingMonth = CASE @FinancialYearStartingQuarter WHEN 1 THEN 1 WHEN 2 THEN 4 WHEN 3 THEN 7 WHEN 4 THEN 10 END

INSERT INTO dbo.DimYear 
 (YearKey, [Year]) 
VALUES 
 (YEAR(@StartDate) - 1, YEAR(@StartDate) - 1);


WHILE (@StartDate <= @EndDate)
BEGIN

 -- Inserting years
 SET @YearKey = YEAR(@StartDate);
 IF NOT EXISTS (SELECT * FROM dbo.DimYear WHERE YearKey = @YearKey)
  INSERT INTO dbo.DimYear (YearKey, [Year]) 
  VALUES 
  (@YearKey, @YearKey);


 -- Inserting quarters
 SET @QuarterKey = Convert(smallint, Convert(varchar(4), YEAR(@StartDate)) + Convert(varchar(1), DATEPART(q, @StartDate)))
 SET @Quarter = DATEPART(q, @StartDate);

 IF NOT EXISTS (SELECT * FROM dbo.DimQuarter WHERE QuarterKey = @QuarterKey)
 INSERT INTO dbo.DimQuarter 
  (QuarterKey, YearKey
  , FinancialYearKey
  , [Quarter], YearQuater, QuarterName
  , FinancialQuarter, FinancialYearQuarter) 
 VALUES (@QuarterKey, @YearKey
  , CASE WHEN @Quarter < @FinancialYearStartingQuarter THEN @YearKey -1 ELSE @YearKey END
  , @Quarter, Convert(varchar(4), YEAR(@StartDate)) + 'Q' + Convert(varchar(4), @Quarter)
  , CASE @Quarter WHEN 1 THEN 'First Quarter' WHEN 2 THEN 'Second Quarter' WHEN 3 THEN 'Third Quarter' ELSE 'Forth Quarter' END
  , CASE @Quarter WHEN 1 THEN 3 WHEN 2 THEN 4 WHEN 3 THEN 1 WHEN 4 THEN 2 END 
  , CASE @Quarter WHEN 1 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q3'
      WHEN 2 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q4'
      WHEN 3 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q1'
      WHEN 4 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q2'END
  );
  

 ---- Inserting months
 SET @MonthKey = Convert(int, Convert(varchar(4), YEAR(@StartDate)) + (CASE LEN(MONTH(@StartDate)) WHEN 1 THEN '0' ELSE '' END) + Convert(varchar(2), MONTH(@StartDate)));
 SET @Month = MONTH(@StartDate)
 IF NOT EXISTS (SELECT * FROM dbo.DimMonth WHERE MonthKey = @MonthKey)
  INSERT INTO dbo.DimMonth 
   (MonthKey, QuarterKey, MonthNumber, MonthName, YearMonth, MonthShortName
   , FinancialMonthNumber) 
  VALUES 
   (@MonthKey, @QuarterKey, @Month, DATENAME(MONTH, @StartDate)
   , Convert(varchar(4), YEAR(@StartDate)) + ' ' + LEFT(DATENAME(MONTH, @StartDate), 3)
   , LEFT(DATENAME(MONTH, @StartDate), 3)
   , CASE 
     WHEN @FinancialStartingMonth = 1 THEN @Month
     WHEN @FinancialStartingMonth = 4 AND @Month  < @FinancialStartingMonth THEN @Month + 9
     WHEN @FinancialStartingMonth = 4 AND @Month  >= @FinancialStartingMonth THEN @Month - 3
     WHEN @FinancialStartingMonth = 7 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 7 AND @Month  >= @FinancialStartingMonth THEN @Month - 6
     WHEN @FinancialStartingMonth = 10 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 10 AND @Month  >= @FinancialStartingMonth THEN @Month + 3
    END
   );
  
  ---- Inserting dates
  SET @DateKey = Convert(int, Convert(varchar(8), @StartDate, 112))
  IF NOT EXISTS (SELECT * FROM dbo.DimDate WHERE DateKey = @DateKey)
   INSERT INTO dbo.DimDate (DateKey, MonthKey, Date, WeekDayNumber, WeekDayName, DayOfMonth
    , DayOfYear, IsWeekend, IsHoliday, WeekNumberOfYear) 
   VALUES (@DateKey, @MonthKey, @StartDate, DatePart(WEEKDAY, @StartDate), DateName(WEEKDAY, @StartDate), DatePart(DAY, @StartDate)
    , DatePart(DAYOFYEAR, @StartDate), CASE WHEN DatePart(WEEKDAY, @StartDate) IN (6, 7) THEN 1 ELSE 0 END, 0
    , DatePart(WEEK, @StartDate));

  SET @StartDate = DATEADD(dd, 1, @StartDate);
END;
GO

Saturday, March 25, 2017

Loading Data Script for Snowflake type Date Dimension

Update: If you need the script for the same with financial year elements, read the new post.

Date dimension is the most common dimension in data warehousing and it is generally designed as a single table. However, if it needs to be used with multiple Fact Tables with different levels, Date dimension table should be broken into multiple tables that creates a Snowflake Schema. If the table is broken into multiple tables such as Year, Quarter, Month and Date, loading them is bit difficult. A script for loading a single table is available in many sites but it is bit difficult to find a script for loading multiple tables. I had a same requirement and could not find a proper script, hence wrote one.

Here is the script if you need the same;

Here is the script for creating tables;
CREATE TABLE dbo.DimYear
(
 YearKey smallint 
 , Year smallint not null
 , Constraint pk_DimYear Primary Key (YearKey)
);
GO

CREATE TABLE dbo.DimQuarter
(
 QuarterKey smallint 
 , YearKey smallint not null
 , Quarter smallint not null
 , YearQuater varchar(20) not null
 , QuarterName varchar(20) not null
 , Constraint pk_DimQuarter Primary Key (QuarterKey)
 , Constraint fk_DimQuarter_DimYear Foreign Key (YearKey)
  References dbo.DimYear (YearKey)
);
GO

CREATE TABLE dbo.DimMonth
(
 MonthKey int 
 , QuarterKey smallint not null
 , MonthNumber smallint not null
 , MonthName varchar(20) not null
 , YearMonth varchar(20) not null
 , MonthShortName char(3) not null
 , Constraint pk_DimMonth Primary Key (MonthKey)
 , Constraint fk_DimMonth_DimQuarter Foreign Key (QuarterKey)
  References dbo.DimQuarter (QuarterKey)
);
GO

CREATE TABLE dbo.DimDate
(
 DateKey int 
 , MonthKey int not null
 , Date date not null
 , WeekDayNumber smallint not null
 , WeekDayName varchar(20) not null
 , DayOfMonth smallint not null
 , DayOfYear smallint not null
 , IsWeekend bit not null
 , IsHoliday bit not null
 , WeekNumberOfYear smallint not null
 , Constraint pk_DimDate Primary Key (DateKey)
 , Constraint fk_DimDate_DimMonth Foreign Key (MonthKey)
  References dbo.DimMonth (MonthKey)
);
GO


And the script for loading all tables.
SET DATEFIRST 1;

DECLARE @StartDate date = '2010-01-01';
DECLARE @EndDate date = '2019-12-31';
DECLARE @YearKey smallint;
DECLARE @Quarter smallint, @QuarterKey smallint;
DECLARE @Month int, @MonthKey int;
DECLARE @Date int, @DateKey int;

WHILE (@StartDate <= @EndDate)
BEGIN

 -- Inserting years
 SET @YearKey = YEAR(@StartDate);
 IF NOT EXISTS (SELECT * FROM dbo.DimYear WHERE YearKey = @YearKey)
  INSERT INTO dbo.DimYear (YearKey, [Year]) VALUES (@YearKey, @YearKey);


 -- Inserting quarters
 SET @QuarterKey = Convert(smallint, Convert(varchar(4), YEAR(@StartDate)) + Convert(varchar(1), DATEPART(q, @StartDate)))
 SET @Quarter = DATEPART(q, @StartDate);

 IF NOT EXISTS (SELECT * FROM dbo.DimQuarter WHERE QuarterKey = @QuarterKey)
  INSERT INTO dbo.DimQuarter (QuarterKey, YearKey, [Quarter], YearQuater, QuarterName) 
  VALUES (@QuarterKey, @YearKey, @Quarter, Convert(varchar(4), YEAR(@StartDate)) + 'Q' + Convert(varchar(4), @Quarter)
   , CASE @Quarter WHEN 1 THEN 'First Quarter' WHEN 2 THEN 'Second Quarter' WHEN 3 THEN 'Third Quarter' ELSE 'Forth Quarter' END);
  

 ---- Inserting months
 SET @MonthKey = Convert(int, Convert(varchar(4), YEAR(@StartDate)) + (CASE LEN(MONTH(@StartDate)) WHEN 1 THEN '0' ELSE '' END) + Convert(varchar(2), MONTH(@StartDate)));
 SET @Month = MONTH(@StartDate)
 IF NOT EXISTS (SELECT * FROM dbo.DimMonth WHERE MonthKey = @MonthKey)
  INSERT INTO dbo.DimMonth (MonthKey, QuarterKey, MonthNumber, MonthName, YearMonth, MonthShortName) 
  VALUES (@MonthKey, @QuarterKey, @Month, DATENAME(MONTH, @StartDate)
  , Convert(varchar(4), YEAR(@StartDate)) + ' ' + LEFT(DATENAME(MONTH, @StartDate), 3)
  , LEFT(DATENAME(MONTH, @StartDate), 3));
  
 ---- Inserting dates
 SET @DateKey = Convert(int, Convert(varchar(8), @StartDate, 112))
 IF NOT EXISTS (SELECT * FROM dbo.DimDate WHERE DateKey = @DateKey)
  INSERT INTO dbo.DimDate (DateKey, MonthKey, Date, WeekDayNumber, WeekDayName, DayOfMonth
   , DayOfYear, IsWeekend, IsHoliday, WeekNumberOfYear) 
  VALUES (@DateKey, @MonthKey, @StartDate, DatePart(WEEKDAY, @StartDate), DateName(WEEKDAY, @StartDate), DatePart(DAY, @StartDate)
   , DatePart(DAYOFYEAR, @StartDate), CASE WHEN DatePart(WEEKDAY, @StartDate) IN (6, 7) THEN 1 ELSE 0 END, 0
   , DatePart(WEEK, @StartDate));

 SET @StartDate = DATEADD(dd, 1, @StartDate);
END;
GO


Tuesday, December 6, 2016

Azure SQL Data Warehouse - Part I - Creating and Connecting

We have been implementing on-premises data warehouses for years and everyone is well aware on difficulties-faced at the first phase of the project, that is requesting hardware for the project. In most cases, either the project is stopped or postponed by the management or client saying that no-budget. But now we have a solution that addresses the initial issue we have. Not only that, this solution gives you many benefits in terms of performance as well. It is Azure  SQL Data Warehouse.

The Azure SQL Data Warehouse is a cloud-based SQL Data Warehouse which can be easily scaled with cloud-based-resources and pay as-it-grows. It is a fully managed service that that covers maintenance, software patches and backups. It can store and process both relational and non-relational data, and many Azure features such as Azure Blob Storage, Azure Data Factory can be used for implementing it.

Let's talk about some benefits in details;

Scalability and Cost
This is one of the major benefits. You can scale Azure Data Warehouse, increasing resources when required and cost will be calculated based on it. Since there is no upfront cost involved with this, you can simply start with less number of resources and adjust when you want. You will be charged on two resource usage types: Storage and Compute.

The storage configured for Azure SQL Data Warehouse is automatically re-scaled as you add and remove data, hence you pay for what you use.

The value set for Compute is the measurement for performance of execution in Azure SQL Data Warehouse. It is measured in Data Warehouse Units (DWU) and you can start with 100 DWU (USD 2 per hour) to 6000 DWU (USD 123 per hour) as per the requirements. This value can be anytime changed and cost of the usage will be calculated as per the value set.


Performance
Since this can be scaled out as you need, data and queries can be distributed, increasing the performance. This uses Massive Parallel Processing Architecture (MPP), hence workload can be spread across many nodes using complex statistics calculated, increasing the performance significantly.

This uses Columnstore Indexing that is 10 times faster than Rowstore Indexing. Since it supports 5 times more compression than standard Rowstore Indexes reducing IOs, it improves the performance more.

Hybrid Cloud
It is a hybrid cloud environment. It supports seamless integration with on-premises data, cloud data and of course unstructured data. With Polybase, different type of sources can be queried using TSQL as data is stored in local, making sure that everything required is shown in a single database.

Not only that, copying data from other sources can be easily done via Azure Data Factory.

Okay, now let's see how we can create one and star using it. First of all, you need an Azure Account. Once an account is created, you need to create a SQL Server which is a logical server for holding your data warehouse. If you have already created a SQL Sever, then it can be used, else follow the steps below for creating a logical server.

1. Login to the Azure Portal and open SQL Servers Blade.
2. Set Sever name, Server admin login, Password, Resource Group and Location. Note that you can add database specific logins later.


3. Click on Create button to get it created. If you want to see the server in Azure Dashboard, make sure that Pin to dashboard is checked.

Now the server is ready. Let's create a data warehouse.

3. Open the SQL Data Warehouses blade.
4. Enter the Database Name. With this test, we will be using AdventureWorks (we will be creating different databases later), hence set the name as AdventureWorksDW.
4. Select the appropriate Resource Group. Make sure that you select the same selected for the server.
5. Select Sample for Select source.
6. Select AdventureWorsDW as the sample database.
7. Select the server created with above steps and enter the admin account and its password.
8. Select DWU as you need. Let's initially set this as 200.


9. Select Pin to dashboard and click Create to get the data warehouse created.

Database is created. Remember, you cannot connect with the server configured until you are permitted to connect from your IP address. Initially you can configure your IP address to the server using the portal. Later, you can connect to the server using Management Studio and configure IP addresses for others using TSQL, either at server level or database level. Here are the steps.

1. Open the portal and open SQL Servers blade.
2. Click on the server created (or the one used for adding the data warehouse) to get its properties.
3. You can add your IP address by clicking Add client IP button and clicking Save button. Or else, you can manually set it using the given grid.


4. Add it and save.

Now firewall rule is configured for you. Let's connect and see.

1. Open Management Studio and connect with the server account as below.


2. Once connected, open a new Query Window and type below code for seeing firewall rules set.


3. If you need to add more rules for others, use sp_set_firewall_rule for server level rules and sp_set_database_firewall_rule for adding database level rules.

-- Adding server level rules
EXECUTE sp_set_firewall_rule @name = N'dinesqlserverAdminRule',
     @start_ip_address = '192.168.10.100', @end_ip_address = '192.168.10.100'

-- Use this for removing it
-- EXECUTE sp_delete_firewall_rule @name = N'dinesqlserverAdminRule'



-- Adding database level rules
-- *** Connect with database and execute
EXEC sp_set_database_firewall_rule @name = N'dinesqlserverDevelopersRule', 
     @start_ip_address = '192.168.10.220', @end_ip_address = '192.168.10.225'

EXEC sp_delete_database_firewall_rule @name = N'dinesqlserverDevelopersRule'

4. Query your database and see now. You should be able to query your table without any issue.


With the next post, let's see how we can use other tools like Visual Studio and Power BI for connecting with the data warehouse created.



Monday, October 24, 2016

Disable PRIMARY KEY before data loading

In order to improve the loading performance, specifically when a data warehouse or a large table is loaded, it is recommended to disable constraints such as PRIMARY KEY, UNIQUE KEY and CHECK and indexes if the data is already cleansed and validated. Although we heavily practice this in data warehousing, we do have some limitations on OLTP databases because the way we structure tables is different from the way we do with data warehouses.

Can we disable primary key of the table and continue with data loading? The answer is, Yes and No.

Remember, when you mark one of your table columns as the primary key, it creates a Clustered Index for the column you picked. If the Clustered Index is already created (before creating the primary key), then it creates a Non-Clustered Index. If the primary key is created with a Clustered Index and if you disable it, table will not be able to accessed because the main structure is disabled. However, if the primary key is created with a non-clustered index, you can disable it and still work with it.

USE tempdb;
GO

-- create a table with a primary key which is clustered
CREATE TABLE dbo.Customer
(
 CustomerID int identity(1,1) not null
 , FirstName varchar(100) not null
 , Constraint PK_Customer PRIMARY KEY CLUSTERED (CustomerID)
);

-- create a table with a primary key which is nonclustered
CREATE TABLE dbo.Employee
(
 EmployeeID int identity(1,1) not null
 , FirstName varchar(100) not null
 , Constraint PK_Employee PRIMARY KEY NONCLUSTERED (EmployeeID)
);

-- Insert some records
INSERT INTO dbo.Customer
 (FirstName) VALUES ('Dinesh'), ('Yeshan');
INSERT INTO dbo.Employee
 (FirstName) VALUES ('Jane'), ('Jack');

-- checking records
-- This should return all recorded we inserted
SELECT * FROM dbo.Customer;
SELECT * FROM dbo.Employee;

-- Disable the primary key of the Customer
-- by disabling associated index
ALTER INDEX PK_Customer ON dbo.Customer DISABLE;

-- Now following statements are not possible
INSERT INTO dbo.Customer
 (FirstName) VALUES ('Kate'), ('Mihen');
SELECT * FROM dbo.Customer;

-- Disable the primary key of the Employee
-- by disabling the associated index
ALTER INDEX PK_Employee ON dbo.Employee DISABLE;

-- These statements work without any issue
INSERT INTO dbo.Employee
 (FirstName) VALUES ('Nazir'), ('Daniel');
SELECT * FROM dbo.Employee;

-- Enabling both
ALTER INDEX PK_Customer ON dbo.Customer REBUILD;
ALTER INDEX PK_Employee ON dbo.Employee REBUILD;

Same goes to Unique Constraint as well.

Note that this is possible with data warehouse because, as a best practice, we do not (always, there are exceptions) make the column marked as primary key as the clustered key. For an example, we may create a primary key using all foreign keys columns but we create the clustered index using DataKey column.

Friday, July 29, 2016

Fact Loading: Switching a loaded temporary table into an empty partition of the fact table

Fact table in data warehouses are always partitioned and with most of solutions, it is partitioned on a date key. Generally, we use SSIS for loading dimensions and fact tables and ETLs written for loading the Fact Table can be optimized by loading data into another table and switching it to an empty partition of the fact table. This is one of the best practices in data warehousing data loading and technique for doing is not a complex task.

Assume you have a Fact Table that is partitioned by moths, example, one partition for 201601, another for 201602, and so on, and data is loaded till July 2016 (201607). Then when we load August data, we can load the data into a new table that has the same structure and then switch that table in to 201608 partition in the fact table for optimizing the data loading and minimizing the impact to the fact table. The below diagram shows it;


Here is the way of doing it. I will take the same Fact Table created with this post: SSIS - Fact Loading with Dimension Type II and Dimension Type I. You can use the same code for creating dimension tables and fact tables for trying out this. In that post, I have loaded data related to June and July 2016. Let's see how we can load August data using the technique discussed in this post.

1. Let's take this data set for loading August data.

SELECT CONVERT(date, '2016-08-01') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-08-12') As TransactionDate, 23 AS ProductId, 101 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-08-14') As TransactionDate, 23 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-08-22') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity

2. We need a different table with the same structure for loading processed data. The below code creates another table for holding data temporarily. Note that it has same columns and it is partitioned same as FactSales table.

CREATE TABLE FactSalesTemp
(
 DateKey int NOT NULL INDEX IX_FactSalesTemp CLUSTERED,
 ProductKey int NOT NULL,
 CustomerKey int NOT NULL,
 EmployeeKey int NOT NULL,
 SalesAmount money NOT NULL,
 SalesQuantity smallint NOT NULL
)
ON ps_SalesDate (DateKey)
GO

3. I will be using the same SSIS package used with my previous post. As you see, now the destination is set to newly created table which is FactSalesTemp.


4. Before loading data into new table, let's add another partition to both table for August data. All we need to do is, add a file group to the scheme and boundary value to the function;

-- Add another file group to the scheme 
ALTER PARTITION SCHEME ps_SalesDate
NEXT USED [PRIMARY]
GO

-- splitting the last partition by adding another boundary value
ALTER PARTITION FUNCTION pf_SalesDate ()
SPLIT RANGE (20160801)
GO

5. Now we can execute the SSIS package and load data into newly created table. Once the data set is loaded into FactSalesTemp, we can check both tables and how partitions are filled.

How partitions are filled;


How tables are filled;


6. As you see, data is loaded into newly created table and they are in partition 9. Now we need to switch the new table into FactSales 9th partition. Here is the way of doing it.

ALTER TABLE dbo.FactSalesTemp 
SWITCH PARTITION 9 TO dbo.FactSales PARTITION 9;

7. Now if you check the records in the table, you will see the FactSales is loaded with new data, and it is loaded to the correct partition.


You can have Altering scheme and function in the Control Flow as a Execute SQL Task before the Data Flow Task added. And, you can have Switching partition  with another Execute SQL Task just after the Data Flow Task for completing the SSIS package.

Thursday, July 28, 2016

SSIS - Fact Loading with Dimension Type II and Dimension Type I

The Dimensional Modeling in data warehousing, either using Star or Snowflake schema is still the most famous structure applied and widely used, though summarized tables using in-memory and columnstore concepts are slowly taking over the traditional method. It is very common to see that SSIS (or SQL Server Integration Services) is used for data loading, applying various patterns. Searches show different types of articles and posts but it is rare to find a complete post that talks about loading a Fact table using SSIS, looking up type I and II dimensions. That is the reason for this post, this talks about how to load a Fact Table, getting keys from both Type I and Type II dimensions.

Let's make a small data warehouse with Star Schema and populate some sample records. The following code creates a database called SalesDataWarehouse and then creates dimension tables named DimDate, DimCustomer, DimProduct, and DimEmployee. As you see, DimEmployee is a slowly changing dimension type II dimension.

CREATE DATABASE SalesDataWarehouse;
GO

-- Connect and create tables
USE SalesDataWarehouse;
GO

CREATE TABLE dbo.DimDate
(
 DateKey int  primary key
 , [Date] date not null
 , [Month] varchar(20) not null
 , MonthNumber smallint not null
 , QuarterName varchar(20) not null
 , QuarterNumber smallint not null
 , [Year] smallint not null
);

CREATE TABLE dbo.DimCustomer
(
 CustomerKey int identity(1,1) primary key
 , FullName varchar(100) not null
 , Town varchar(100) not null
 , SourceKey int not null
);

CREATE TABLE dbo.DimProduct
(
 ProductKey int identity(1,1) primary key
 , Product varchar(100) not null
 , Brand varchar(100) not null
 , SourceKey int not null
);

CREATE TABLE dbo.DimEmployee
(
 EmployeeKey int identity(1,1) primary key
 , Name varchar(100) not null
 , SalesTown varchar(100) not null
 , StartDate date not null
 , EndDate date  null
 , IsCurrent bit not null default (1)
 , SourceKey int not null
);

The next code is for loading some sample records. Note that DimDate is loaded with WideWorldImportersDW database as its date dimension table contains records for the current year. However, this can be loaded using a simple script as well.

Have a look on records related to DimEmployee. Since it is a SCD Type II dimension, it holds history records. As you see, there are two records to James Phil with validity period.

-- Inserting date records from AdventureWorks table
INSERT INTO dbo.DimDate
 (DateKey, [Date], [Month], MonthNumber, QuarterName, QuarterNumber, [Year])
SELECT CONVERT(int, CONVERT(varchar(8), [Date], 112)), [Date], [Month], [Calendar Month Number]
 , 'Quarter ' + CONVERT(varchar(2), DATEPART(Q, [Date])), DATEPART(Q, [Date]), [Calendar Year]
FROM WideWorldImportersDW.Dimension.Date;

-- Inserting records to Customer
INSERT INTO dbo.DimCustomer 
 (FullName, Town, SourceKey)
VALUES
 ('Jane Knight', ' Town 1', 100)
 , ('Jack walter', 'Town 2', 101);

-- Inserting records to Product
INSERT INTO dbo.DimProduct 
 (Product, Brand, SourceKey)
VALUES
 ('Product 1', ' Brand 1', 22)
 , ('Product 2', 'Brand 3', 23);

-- Inserting records to Customer
INSERT INTO dbo.DimEmployee
 (Name, SalesTown, StartDate, EndDate, IsCurrent, SourceKey)
VALUES
 ('Peter Kevin', 'Town 1', '2016-06-01', null, 1, 200)
 , ('James Phil', 'Town 2', '2016-06-01', '2016-06-30', 0, 201)
 , ('James Phil', 'Town 3', '2016-06-30', null, 1, 201);


Next code is for creating the Fact table. Generally, Fact Table is a partitioned table. Therefore, this code creates a partition function and scheme for it. Note that, the best practice is, have different File Groups for partitions but let's just use PRIMARY (I will be discussing partition switching on fact loading with another post, extending the same code).

-- creating the partition
CREATE PARTITION FUNCTION pf_SalesDate (int)
AS RANGE RIGHT
FOR VALUES (20160101, 20160201, 20160301, 20160401, 20160501, 20160601, 20160701)
GO

-- creating the scheme
CREATE PARTITION SCHEME ps_SalesDate
AS PARTITION pf_SalesDate
TO ([PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY])
GO

CREATE TABLE FactSales
(
 DateKey int NOT NULL INDEX IX_FactSales CLUSTERED,
 ProductKey int NOT NULL,
 CustomerKey int NOT NULL,
 EmployeeKey int NOT NULL,
 SalesAmount money NOT NULL,
 SalesQuantity smallint NOT NULL
)
ON ps_SalesDate (DateKey)
GO

Let's insert some sample records into the Fact Table too.

INSERT INTO FactSales (DateKey, ProductKey, CustomerKey, EmployeeKey, SalesAmount, SalesQuantity)
VALUES
 (20160605, 1, 1, 1, 500, 2)
 , (20160615, 2, 2, 1, 500, 2)
 , (20160622, 2, 2, 2, 500, 2)
 , (20160624, 2, 1, 1, 500, 2)

Done, now we have to create a SSIS package for loading new records. As you noted, we have records related to June 2016, and we will be loading records related to July 2016. Let's use following recordset for loading the Fact Table.

SELECT CONVERT(date, '2016-07-01') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-12') As TransactionDate, 23 AS ProductId, 101 AS CustomerId, 200 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-14') As TransactionDate, 23 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity
UNION ALL
SELECT CONVERT(date, '2016-07-22') As TransactionDate, 22 AS ProductId, 100 AS CustomerId, 201 AS EmployeeId, 500 AS SalesAmount, 2 AS SalesQuantity


Let's create a SSIS package and add a connection to the data warehouse. I have a database named Sales and will be using it as the source even though we do not extract records from the source. As you see, I have a DataFlow Task in the Control Flow.


Since we do not get surrogate key values from the source, we need to have Lookups for dimension tables for getting surrogate key values. As you see with the image below, I have added OLEDB Source to connect with Sales and four of Lookups for getting surrogate key values related to Date, Customer, Employee and Product.


Let's have a look on source first. It is configured with Sales database and the query above is used for populating required records.


There are three Type I type dimensions. Let's have a look on one and see how it needs to be checked. In order to get DateKey from the DimDate dimension, we need to look up DimDate by using TransactionDate that comes with the source. We need to configure the lookup to DataWarehouse connection and select the lookup table as DimDate. Once the connection is set, we need to link TransactionDate coming from the source with Date in the dimension. We need to select DateKey as it is the one we need when we insert records to the Fact Table. The selected DateKey is added to the flow automatically.


The similar configuration is required for DimCustomer and DimProduct for getting surrogate key values. We need to link CustomerID of the source to SourceKey in the DimCustomer table and get the CustomerKey. We need to link ProductID of the source to SourceKey in the DimProduct table and get the ProductKey. However, getting EmployeeKey is different as it is as SCD Type II dimension. Let's have a look on the configuration of Employee Lookup.

I will take you through the steps required for configuring this. First we need to make sure that Cache mode is set to either Partial Cache or No Cache because we are going to parameterize the lookup query.


Next is setting up the connection. We need to connect it to DataWarehouse and use a query for loading dimension records. Note the way it has been written. I have used question marks for parameterizing the query as we need to pick the right record related to the TransactionDate.


Just like the other dimensions, we need to match two columns from the source and dimension to find the record. However, additional link is required for getting values for parameters, hence need to link TransactionDate with StartDate though that is not the exact link we need.


You will see the reason for joining TransactionDate and StartDate with next step. We need to adjust the lookup query now, need to do it with Advanced section. It is required to adjust the query like below.


Once adjusted, need to set parameters like below. It allows us to select TransactionDate as a parameter value because the above join. Note that parameters a 0-based indexed.


All done, finally it needs the destination. Set OLEDB Destination for loading the FactSales as the last step the of the DataFlow task. Once all configurations are done, you can run the package and see how the FactSales has been loaded.




Friday, July 8, 2016

Conformed dimensions and Stovepipe data marts

Long time back, I wrote a small note on Stovepipe data marts, thought make a blog post again on it as it is still unknown to many. The simplest way of remembering What is Stovepipe Data Mart, is remembering a  data mart that cannot be either extended to or integrated to Data Warehouse. This happens if your data mart does not contain Conformed Dimensions.


How do we know whether we design conformed dimensions?

Most of the entities identified during the requirement gathering process are common to entire organization. Example, Product, Employee, Customer are some of the entities. Although these entities have many attributes, only few attributes will be taken into the account during the design of data marts. Not only that, some attributes may not be correctly identified and designed wrongly though the wrong design satisfy the current requirement up to some extent. This does not lead you to conformed dimensions.

In order to make sure that the dimension designed is a conformed dimension, make sure that all related and known attributes are correctly added. There can be a situation that you need only few attributes to the current data mart, but if you have identified many attributes, make sure everything is added. When you add them, make sure that correct business logic related to them is implemented. This makes the dimension as a conformed dimension. Example, if yo u have designed a dimension called Product for Sales data mart and you have added all possible attributes and implemented all possible business logic, then it is a conformed dimension and you will be able to use it when you design Marketing or Finance data marts.

Coming back to Stovepipe data marts, simply if you do not have conformed dimensions, it can be considered as a Stovepipe data mart. You will not be able to integrate the data mart with data warehouse without doing modifications to the data mart, which will be costly, or which will be impossible considering factors like resource availability, system usage and budget constraint. Therefore make sure that you are not leading to a Stovepipe data mart