Dinesh's Blog :::: Being Compiled ::::: ETL

Showing posts with label ETL. Show all posts

Sunday, June 24, 2018

Azure Data Factory V2 - Lookup and ForEach Activities

Here is my third video on Azure Data Factory. This video discusses and demonstrates;

Why we need iteration in ETLs
Lookup Activity
ForEach Activity
Demo: Iterating over a data collection using ForEach Activity, with data loaded from Lookup Activity

Thursday, February 1, 2018

Azure Data Factory V2 - Creating a Pipeline using Author-Tab

Here is another video on Azure Data Factory V2. This video explains how to create a Pipeline manually, using Author-Tab. This video discusses and demonstrates;

How to create Linked Services manually
How to create Datasets manually
How to create a Pipeline and add Activities
And how to schedule the Pipeline

Thursday, January 25, 2018

Azure Data Factory V2 - Copying On-Premise SQL Server data to Azure Data Lake Store

Azure Data Factory has been enhanced significantly with V2 and its support on Cloud-ETL-and-ELT is excellent now. First version of it was not much user-friendly but now it is not that difficult to implement ETL solutions with V2.

This video discusses;

Why we need Azure Data Factory
What is Azure Data Factory
How it works and how it should be used
Integration Runtime - installing for connecting with sources
Copying On-Premise SQL Server Data to Azure Data Lake using Data Factory

Here is the video;

Here are some sample screens;

Saturday, March 25, 2017

Loading Data Script for Snowflake type Date Dimension

Update: If you need the script for the same with financial year elements, read the new post.

Date dimension is the most common dimension in data warehousing and it is generally designed as a single table. However, if it needs to be used with multiple Fact Tables with different levels, Date dimension table should be broken into multiple tables that creates a Snowflake Schema. If the table is broken into multiple tables such as Year, Quarter, Month and Date, loading them is bit difficult. A script for loading a single table is available in many sites but it is bit difficult to find a script for loading multiple tables. I had a same requirement and could not find a proper script, hence wrote one.

Here is the script if you need the same;

Here is the script for creating tables;

CREATE TABLE dbo.DimYear
(
 YearKey smallint 
 , Year smallint not null
 , Constraint pk_DimYear Primary Key (YearKey)
);
GO

CREATE TABLE dbo.DimQuarter
(
 QuarterKey smallint 
 , YearKey smallint not null
 , Quarter smallint not null
 , YearQuater varchar(20) not null
 , QuarterName varchar(20) not null
 , Constraint pk_DimQuarter Primary Key (QuarterKey)
 , Constraint fk_DimQuarter_DimYear Foreign Key (YearKey)
  References dbo.DimYear (YearKey)
);
GO

CREATE TABLE dbo.DimMonth
(
 MonthKey int 
 , QuarterKey smallint not null
 , MonthNumber smallint not null
 , MonthName varchar(20) not null
 , YearMonth varchar(20) not null
 , MonthShortName char(3) not null
 , Constraint pk_DimMonth Primary Key (MonthKey)
 , Constraint fk_DimMonth_DimQuarter Foreign Key (QuarterKey)
  References dbo.DimQuarter (QuarterKey)
);
GO

CREATE TABLE dbo.DimDate
(
 DateKey int 
 , MonthKey int not null
 , Date date not null
 , WeekDayNumber smallint not null
 , WeekDayName varchar(20) not null
 , DayOfMonth smallint not null
 , DayOfYear smallint not null
 , IsWeekend bit not null
 , IsHoliday bit not null
 , WeekNumberOfYear smallint not null
 , Constraint pk_DimDate Primary Key (DateKey)
 , Constraint fk_DimDate_DimMonth Foreign Key (MonthKey)
  References dbo.DimMonth (MonthKey)
);
GO

And the script for loading all tables.

SET DATEFIRST 1;

DECLARE @StartDate date = '2010-01-01';
DECLARE @EndDate date = '2019-12-31';
DECLARE @YearKey smallint;
DECLARE @Quarter smallint, @QuarterKey smallint;
DECLARE @Month int, @MonthKey int;
DECLARE @Date int, @DateKey int;

WHILE (@StartDate <= @EndDate)
BEGIN

 -- Inserting years
 SET @YearKey = YEAR(@StartDate);
 IF NOT EXISTS (SELECT * FROM dbo.DimYear WHERE YearKey = @YearKey)
  INSERT INTO dbo.DimYear (YearKey, [Year]) VALUES (@YearKey, @YearKey);


 -- Inserting quarters
 SET @QuarterKey = Convert(smallint, Convert(varchar(4), YEAR(@StartDate)) + Convert(varchar(1), DATEPART(q, @StartDate)))
 SET @Quarter = DATEPART(q, @StartDate);

 IF NOT EXISTS (SELECT * FROM dbo.DimQuarter WHERE QuarterKey = @QuarterKey)
  INSERT INTO dbo.DimQuarter (QuarterKey, YearKey, [Quarter], YearQuater, QuarterName) 
  VALUES (@QuarterKey, @YearKey, @Quarter, Convert(varchar(4), YEAR(@StartDate)) + 'Q' + Convert(varchar(4), @Quarter)
   , CASE @Quarter WHEN 1 THEN 'First Quarter' WHEN 2 THEN 'Second Quarter' WHEN 3 THEN 'Third Quarter' ELSE 'Forth Quarter' END);
  

 ---- Inserting months
 SET @MonthKey = Convert(int, Convert(varchar(4), YEAR(@StartDate)) + (CASE LEN(MONTH(@StartDate)) WHEN 1 THEN '0' ELSE '' END) + Convert(varchar(2), MONTH(@StartDate)));
 SET @Month = MONTH(@StartDate)
 IF NOT EXISTS (SELECT * FROM dbo.DimMonth WHERE MonthKey = @MonthKey)
  INSERT INTO dbo.DimMonth (MonthKey, QuarterKey, MonthNumber, MonthName, YearMonth, MonthShortName) 
  VALUES (@MonthKey, @QuarterKey, @Month, DATENAME(MONTH, @StartDate)
  , Convert(varchar(4), YEAR(@StartDate)) + ' ' + LEFT(DATENAME(MONTH, @StartDate), 3)
  , LEFT(DATENAME(MONTH, @StartDate), 3));
  
 ---- Inserting dates
 SET @DateKey = Convert(int, Convert(varchar(8), @StartDate, 112))
 IF NOT EXISTS (SELECT * FROM dbo.DimDate WHERE DateKey = @DateKey)
  INSERT INTO dbo.DimDate (DateKey, MonthKey, Date, WeekDayNumber, WeekDayName, DayOfMonth
   , DayOfYear, IsWeekend, IsHoliday, WeekNumberOfYear) 
  VALUES (@DateKey, @MonthKey, @StartDate, DatePart(WEEKDAY, @StartDate), DateName(WEEKDAY, @StartDate), DatePart(DAY, @StartDate)
   , DatePart(DAYOFYEAR, @StartDate), CASE WHEN DatePart(WEEKDAY, @StartDate) IN (6, 7) THEN 1 ELSE 0 END, 0
   , DatePart(WEEK, @StartDate));

 SET @StartDate = DATEADD(dd, 1, @StartDate);
END;
GO

Thursday, December 17, 2015

What is Data Wrangling? Is it same as ETLing?

While I was reading a newsletter received from Hortonworks, I noticed an article related to data, titled with Data Wrangling. With my experience, dozens of business intelligence implementations, though I have worked with hetorogenious data sets, I have never used this term, even when discussing ETLing. Asked few, this is an unknown to many, hence thought to make this post, just discussing how I see the Data Wrangling and how I see it differently comapring well known ETLing.

ETL, Extract, Transform and Loading, is a common technique we use with Data Integration (DI), Data Warehousing and Business Intelligence. This is more on structured data with well know data sources and mostly with automated tools. This extracts data from various, scattered systems, and prepares data as rich-consumable and loads to the destination, specifically data warehouse. Data Wrangling does the same but few differences.

[image taken from: http://www.datawatch.com/what-is-data-wrangling/]

Data Wrangling works more on unorganized, unstructured, large data set rather a set of structured data. This talks about a manual process that coverts data from one raw form to another format which is more readable and organized for analyzing data. As per the articles read, the term was introduced by Trifacta that offers a tool to help on this process. More on this, the person who does this process is called as Data Wrangler.

Saturday, June 27, 2015

Ways of incrementally loading data to a data warehouse

Once I wrote a post on Incremental ETL (Incremental ETL: How to identify changes for fact data when no support from the source) that discusses how to use Checksum for identifying changes before transferring the extracted data from the source, specifically when source does not support identifying changed records. This raised a question: what are the other ways of incrementally loading a data warehouse, hence this post speaks about them.

There are multiple ways of loading incremental data into data warehouse. One has to be picked based on many factors such as performance, whether data warehouse accepts updates, whether data warehouse maintains history and whether staging environment is maintained (Do we need a staging database for warehousing projects?). Here are most common ways used;

CDC enabled source tables
If the source table is enabled with CDC (Changed Data Capture), then changes can be easily identified and warehouse table, either dimension or fact can be updated appropriately. If ETL is written with SSIS, CDC related tasks can be used for identifying changes and splitters can be used for identifying the type of the change such as INSERT, UPDATE, DELETE and the update the warehouse table accordingly.

The Slowly Changing Dimension transformation
This is the most common used one for loading dimension (not very large dimension) tables as it supports identifying changes comparing source and warehouse, doing inserts and updates, and most importantly handling type 2 type slowly changing dimensions. This does not require source maintaining high water mark and can be easily implemented using the wizard given with SSIS.

The MERGE statement
This T-SQL statement allows us to perform insert, update and delete operations using a single statement combining source and destination (warehouse). Since the code has to be executed with SQL Server engine, SQL Server instance that hosts the warehouse should be able to access the source. This accessibility is possible when source is in one of the databases in the same SQL Server instance or source can be linked as a linked server to SQL Server instance. The most common practice is, loading data from the source and performing necessary transformations using SSIS and then transformed data is loaded to a temporary table in the warehouse. A stored procedure in the warehouse then performs merging using MERGE statement.

Using Checksum

This technique is used when there is no way of identifying changes at the source and load is too heavy for Slowly Changing Transformation. This is usually implemented with staging environment and checksum value is used for identifying whether the record is changed or not. Read more on this at: http://dinesql.blogspot.com/2015/04/incremental-etl-how-to-identify-changes.html

The Lookup transformation

This SSIS transformation is used for smaller dimensions. This is not specifically given for handling this scenario but it can be used for comparing source data with existing data in the warehouse and find matching records.

Considering all five above, me preferred ones are Checksum and MERGE because they offer flexibility, maintainability and most importantly good performance.

Monday, June 15, 2015

ETL Data Flow Architectures

ETLing is the way of getting data from one or more sources into your reporting environment or to a repository that holds enterprise-wide data for all reporting and analytical requirements. It is a common and well-known term in data warehousing because a data warehouse solution is impossible to exist without an ETL implementation.

ETL stands for extract, transform and loading, describing the data flow. Generally we extract data from sources, do validations on extracted data, and load the destination, most of the time, destination is a data warehouse. If the ETL solution is very small and less complex, data flow is always from sources to destination without any middle components. However if it is not, the architecture related to data flow is different and it includes additional components. The architecture of the data flow can be determined by considering following;

The number of data sources to be accessed.
The amount of data to be accessed and transferred in a single execution.
The type of loading, whether it is complete refresh or incremental loading.
Complexity of validations to be applied (transformations).
Data generation frequency at sources.
The extraction windows.
The accessibility of data sources.

Considering above factors, there are three possible ETL data flow architectures;

Single-stage ETL architecture

The single-stage ETL architecture is suitable for small and less complex data flows. With this architecture, transformation is done at the extraction and in-flight.

Combining First Name, Middle Name and Last Name into Full Name at the extraction and finding duplicates of Customer Entity are some examples for transformations performed with this architecture.

Two-stage ETL architecture

This data flow architecture maintains an additional environment called Staging. It can be a database (in most cases), set of text files or even Excel files. The staging is used as a temporary container for extracted data and it is implemented based on few factors such as different data acquisition windows, identifying modifications done with previously loaded data and auditing purposes. For more info o staging refer my post: Do we need a staging database for warehousing projects?.

Transformation on extracted data can be done in several stages. It can be performed at the extraction from sources and staging, and in-flight between sources and staging, and between staging and destination.

Three-stage ETL architecture

This extends two-stage architecture by introducing another layer called Landing Zone.

This helps to reduce the workload (including transformation at extraction) with the sources. We can extract data as available in the source without adding any complexities for extraction queries, minimizing time it takes for completing the extraction. Then possible transformation can be applied to Landing Zone or later stages. Sometime, sources pushes data (via reports) without letting us to connect with sources. In a scenario like that, Landing Zone is the place where all source files are placed.

Saturday, June 13, 2015

What is "High Water Mark" in ETLing?

When extracting data from one database to another in continuous fashion, two common techniques are used; delete the destination and reload with extraction or load the destination as an incremental load. First technique is straight forward, but for the second, changes have to be identified, it is commonly done with a datatime column in the source that holds either created date or last modified date.

What is the link between this and "High Water Mark"? The High Water Mark is used for indicating the highest water level for a tide or flood. Same theory can be applied for our scenario, considering datetime column in the source as High Water Mark.

Thursday, April 30, 2015

Incremental ETL: How to identify changes for fact data when no support from the source

"Data warehouse is a read-only database": this is one of the characteristics of data warehouse and we used to say "History cannot be changed" for supporting this characteristic. Therefore we always try our best not to bring changes into the data warehouse.

This was thoroughly considered and applied with traditional data warehousing because the cost of processing OLAP models was very high. As I see, with modern implementations, this cost has gone down for two reasons; Tabular Model implementations instead of Multi-Dimensional Models and User-Specific Models with Self-Service BI. Therefore, if it is required, changes can be brought into the data warehouse for some extent considering both cost and benefits of it.

However, when we try to bring changes into the data warehouse, one major issue we face is, less support-ability from sources for finding changes in previous transactions that have already been loaded. Some sources maintain a timestamp indicating the changed date or a flag indicating whether the record is modified or not. If there is no such thing with the source, changes cannot be easily identified. In this scenario, one way of identifying changes is, checking and comparing each and every record, each and every coloumn in the fact table and seeing whether they are different. If the dataset is very large and data warehouse is large too, this is not much practical. Therefore either we should stop taking changes once data is loaded to the data warehouse or should take with some agreed rules.

Here is one of the ways I have used. First of all, we need to set an acceptable time period for accepting changes from the source. The agreed period can be 1 week, 1 month or 3 months. However this is based on certain factors such as business logic involved with transactions, complexity and volume. Once agreed, next step is, holding loaded data in the staging environment for the same period. If we have to expect changes for last three months, we need to make sure that staging database has last three months extracted data. In addition to that, we need to maintain an additional column for holding checksum. That is what we have to used for comparing records between the source and staging, not comparing each and every column.

For example, assume that Sales data related to FactSales in SalesDataWarehouse is loaded via a staging table which is called Sales. For that, considering the above requirement, the structure of the staging table must be created as below;

-- Staging table for holding sales
CREATE TABLE dbo.Sales
(
 SalesOrderID int NOT NULL
 , OrderDate datetime NOT NULL
 , ShipDate datetime NOT NULL
 , SalesOrderNumber varchar(25) NOT NULL
 , Product varchar(100) NOT NULL
 , ListPrice money NOT NULL
 , UnitPrice money NOT NULL
 , OrderQty int NOT NULL
 , LineTotal money NOT NULL
 , CheckSumValue int NOT NULL -- Checksum value from above columns
 , RecordType tinyint NOT NULL -- 1-New, 2-Modified, 3-Loaded
);
GO

-- Creating clustered index
CREATE UNIQUE CLUSTERED INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

-- Creating an index on Checksum column
CREATE INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

When we load new data, for an example, yesterday data, we load data into the staging table with RecordType = 1. Once inserted, must generate the Checksum value for newly inserted records.

UPDATE dbo.Sales
 SET CheckSumValue = Checksum(SalesOrderID, OrderDate
      , ShipDate, SalesOrderNumber
      , Product, ListPrice, UnitPrice
      , OrderQty, LineTotal)
WHERE RecordType = 1;

After that ETL module can access new records for loading the data warehouse. Once done, RecordType must be set to 3. Now how do we handle changes for last three months?

For that, we need to get last three months transactions from the source again and calculate the checksum for extracted data. If SSIS is used for loading the staging, then we can load data into a Data Flow and generate the new Checksum. If we use a method like bcp or TSQL, it is better to load them into another temporary table. Once we have the dataset with new checksum, records can be checked and seen by comparing new checksum and old checksum, and update the records in the staging as below;

UPDATE dbo.Sales
 SET OrderDate = t.OrderDate
  , ShipDate = t.ShipDate
  , SalesOrderNumber = t.SalesOrderNumber
  , Product = t.Product
  , ListPrice = t.ListPrice
  , UnitPrice = t.UnitPrice 
  , OrderQty = t.OrderQty
  , LineTotal = t.LineTotal
  , RecordType = 2
FROM dbo.Sales s
 INNER JOIN TempSales t
  ON s.SalesOrderID = t.SalesOrderID 
   AND s.CheckSumValue != Checksum(t.SalesOrderID, t.OrderDate
      , t.ShipDate, t.SalesOrderNumber
      , t.Product, t.ListPrice, t.UnitPrice
      , t.OrderQty, t.LineTotal);

This updates RecordType as 2 indicating that the record is changed. Once updated, another ETL can access these records and update FactSales in the data warehouse.

This reduces the complexity related to finding changes and time and resources required for processing. However, since we use Checksum, for certain situation, same value might generated for the new value even though there is a change. If you experience it, it is recommended to use Hasbytes instead of Checksum.

Tuesday, April 28, 2015

ETL or ELT? Conversation I had with a participant

Everyone is familiar with the term "ETL" and everyone knows about the processes related to it. This concept refers extracting data from one or more heterogeneous sources, transforms them into a structure that destination can accept while applying all possible validations and business logic, and loads the destination. Is this always ETL? Can I refer this as ELT too? I referred this process as ELT in one of my recent presentations while describing Big Data and Microsoft APS (Analytics Platform System), and of course I was questioned by one of participants. Here is the conversation;

Participant: Hey Dinesh, it is a great session, learnt a lot from it, I need a clarification on an area you were discussing, have a minute?

Me: Thanks, sure, what is the area?

Participant: I have been implementing BI solutions for last 4 years and have done many different implementations on ETL side. During your presentation, you mentioned that the ETL term as ELT, not once twice, so, is ELT a known term or you just referred it?

Me: Yes, I referred it, and your first question, whether it is a known and standard term, I am 100% sure that it is a known term but I do not know whether it is a standard term. But I believe that the process can be referred as ELT based on the way it has been implemented.

Participant: Have you seen or heard this term before? Or you just created it :) ?

Me: :), I did not, this is being referred by many experts, I have seen it with videos and articles too. It is not an uncommon one but I think it is bit new to the industry, or it is not, but no one bothered to refer the process as ELT.

Participant: What this ELT exactly referring? Is it all about transforming loaded data violating the traditional rule?

Me: Yes it is, it refers applying transformation after loading your destination but it does not mean that we are violating any rules or standards. Think about loading data into staging environment. We load data and do some validation, apply some business logic to loaded data. If we just isolate "loading of staging", considering the staging as the destination, what we have done is ELT, not ETL.

Participant: Yes, I can understand, I have done the same too. How this involves with Big Data?

Me: I was explaining about Microsoft Analytics Platform System. With PDW (Parallel Data Warehouse Edition) or with standard Hadoop environment, we distribute the process into multiple nodes for getting the process done quickly. So, if the data warehouse is implemented with PDW, means millions of records, we can load data into distributed table in PDW database as the first step and then process them, applying complex business logic rather than doing them during the loading phase. This speeds up the whole ETL process. This is not extract, transform and loading, this is extract, loading and transform. And we can simply refer this as ELT.

Participant: Great, good explanation. Now it is clear. Thanks Dinesh

Me: Thank you too, I always prefer to hear questions, it improves my knowledge and encourages me to do more exploration.

Thursday, April 23, 2015

Do we need a staging database for warehousing projects?

Deciding an architecture for a data warehouse project and applying it is a challenge. The designer, or the architect has to decide many things, what to be included, what to be avoided, what, when and where the processes have to be implemented, so many decisions to be taken, and staging is another thing to be decided. This post speaks about it.

Do we always need a staging environment? Is it a must that has to be implemented as a part of data warehouse solution? Or can we simply ignore it?

Staging is not something that has to be considered as use-it-if-you-like or ignore-it-if-you-dont-like. The necessity of staging comes with few factors. Here are some factors that can be used for determining whether you need a staging environment for your solution.

Source data availability during non-processing hours

Usually, we schedule the ETL process and it is generally off-peak hours. However, if acquisition window for data extraction from the source is during peak hours, then no options, data has to be extracted and stored in a temporary environment until the process starts. In a situation like this, it is better to use a staging database for holding data temporarily.
Difference partial-data acquisition windows for data extraction.

There are some instances that you need a dataset for running an ETL for loading some facts but the full dataset is not available for one-time extraction. Assume that the source processes some data and loads the processed data into another table. If this process happens every two hours and the process flushes existing data before loading new data, and if this is what we need for loading facts, we need to continuously access the source table (that contains processed data) before the next process and hold them until we have the complete set. This requires staging environment for holding extracted data until the dataset is ready for running the ETL.
Source pushes data, we do not pull.

Some sources do not support establishing connections to it and it pushes required data out. Most of the time, sources dump data as either csv or xml files but it might write data directly to a given destination. In order to support this, staging environment is needed because we never allow sources to write data directly into the data warehouse.
Multiple data acquisition windows for data extraction.

Multiple data acquisition windows are common with single-source-extraction as well as multi-source-extractions. It is common to extract from multiple sources (or multiple entities in the same source) and process them together as a part of ETL operation. Since data extraction happens with multiple acquisition windows and all need to be processed at a scheduled time, data extracted has to be held until ETL starts. Because of this, staging environment is required for holding data.
Simple transformations.
Generally transformation happens as part of standard ETL process but some transformation including validation can be done at a different level, either at the source level or an intermediate level before starting the standard ETL process. Good example is, combining all names columns such as first name, middle name, last name and forming a new column named Full Name. If this type of transformation cannot be done at source level, best place is then staging. Converting string-typed-date to date-type, removing nulls and correcting numeric values are common with this.
Auditing purposes.
What if we need to keep records on what we extract from the sources for auditing purposes? Although data warehouse contains all extracted, it cannot be considered as the exact set extracted because transformation makes changes on them. Therefore, best option is, keeping them in the staging environment for a certain period.
Identifying data changes in sources.
It is always better to extract only changes from the sources in order to reduce the load and process them only. But what if, source does not offer any way to identify changes and forces us to extract all. Situation like this is complex but we often use staging database to handle this up to some extent. For example, staging can help us to hold at least three months data for facts. When extract, ETL extracts for last three months data from the source and do the comparison between extracted dataset and data in the staging rather than comparing extracted dataset and data in the data warehouse for identifying changes. Once identified, filtered dataset from staging can be used as the source for continuing the ETL operation.

One more thing to be remember. Staging does not necessarily need to be a database. It can be a CSV file, XML file or SSIS Raw file too.

Tuesday, April 21, 2015

What is a bespoke provider?

Have you heard about this provider before? When we need to perform data extraction from another source, we need a provider for connecting with the source. Most of the time we use either ODBC or OLDE DB. But some data sources use proprietary storage (or custom storage) which needs a custom code or custom module for connecting with the source. This is called as bespoke provider.

This is term was initially used for custom-tailored clothes and it is pronounced as bee-SPOHK).

In many cases, if the source or the application supports extracting data as a report as saving it as either a CSV file or an XML, we tend to use that method for data extraction rather than going through bespoke provider. However if the provide given is smart and flexible, it is better to use it.