Saturday, May 23, 2015

Should I consider Collation when creating a temporary table?

We always create temporary tables for manipulating data for temporary processes and reports. However, unawareness of collation related issues with tempdb might introduce unexpected result with processes and reports, should we really consider it?

What is collation? Collation refers to set of rules that describe how to store and compare data. These rules includes order of characters, case-sensitivity of them, accent-sensitivity of them, kana character types and character width. SQL Server supports on setting collation at instance level, database level or even at column level.

Now, why we need to consider tempdb collation? Have a look on following example. It creates a Database. Note that collation of it is SQL_Latin1_General_CP1_CS_AS (Case Sensitive). Then the table is created and two sample records are added. Assume that this is the business requirement and last SELECT should be based on case and it returns only one record.

USE master;
GO

-- creating database with case-sensitive collation
CREATE DATABASE TestDatabase100 COLLATE SQL_Latin1_General_CP1_CS_AS;
GO

USE TestDatabase100;
GO

-- create a table
CREATE TABLE dbo.TestTable
(
 Id int PRIMARY KEY
 , Name varchar(100)
);
GO

-- insert two records
INSERT INTO dbo.TestTable 
VALUES
(1, 'abc'), (2, 'ABC');

-- Looking only for the second record
-- and this will return ONLY second one
-- as database is case sensitive (collation
-- for the columns are derived from top level)
SELECT * FROM dbo.TestTable WHERE Name = 'ABC';

Assume that same logic has to be implemented with another code using a temporary table, as a part of newly created database.

USE TestDatabase100;
GO

-- create a temporary table
CREATE TABLE #TestTable
(
 Id int PRIMARY KEY
 , Name varchar(100)
);
GO

-- Load data from original table
INSERT INTO #TestTable 
SELECT Id, Name FROM TestTable;

-- Check for result
SELECT * FROM #TestTable WHERE Name = 'ABC';

This will be the result of this.



It returns both records instead of last records. The reason for this behavior is, tempdb collation. Since temporary table has been created inside the tempdb, collation of the tempdb has been used instead of collation of user-defined database. This is why we need to consider the collation of tempdb.

Below code shows how to check the collation of tempdb and how to create the temporary table properly.

-- Checking collation of tempdb
SELECT name, collation_name
FROM sys.databases
WHERE name = 'tempdb'
GO

-- create the temporary table properly
CREATE TABLE #TestTable
(
 Id int PRIMARY KEY
 , Name varchar(100) COLLATE SQL_Latin1_General_CP1_CS_AS
);
GO


How to decide the number of data files required for tempdb?

If your tempdb is used heavily and experiences some slow performance with it, adding multiple data files might help to improve the overall performance of tempdb. Having multiple data files addresses issues related IO restrictions and avoids latch contention during Page Free Space (PFS) scans when temporary objects are created and dropped. But how do we decide the number of files should be added to the tempdb?

There is no specific formula to calculate it but a recommendation. Too much of files are not recommended at all as it might degrade performance rather improving performance. General rule is;

1 file per core or 0.25 file per core if as number of core increases. 

Optimal configuration can only be identified by doing some tests during peak hours but better to follow below guidelines for configuring your tempdb;

If the system has <= 8 cores, then number of data files = number of cores
If the system has > 8 cores, then number of data files = 8.


Monday, May 18, 2015

Analysis Services, Integration Services and Reporting Services menu items in the Connect menu are disabled, what is the reason?

If you need to connect with Business Intelligence services via Management Studio, you use Connect menu in the Object Explorer. Generally this is what you see with it;



However, what if you see them disabled and cannot use them for connecting with services like SSAS, SSIS and SSRS? There can be few reasons but obvious reason would be Not Installing  required components with SQL Server installation. How do we know whether we have installed them or not?

For that, you need to read the Configuration.ini file in the log folder. Navigate through C:\Program Files\Microsoft SQL Server\120 (your instance version)\Setup Bootstrap\Log\ folder and open the log folder related to the installation. You should see Configutation.ini file in the folder.



If you open the file and look for FEATURES, you should see all installed client components. If you do not see AS, RS, and IS, it means you have not installed them during client components installation. 



If you need them, installation has to be run again for client components and make sure you select Management Tools - Basic and Management Tools - Complete.



Sunday, May 17, 2015

SQL Server Brain Basher of the Week #012 - Partitions

While I was discussing Partitioning with one of my discussions in this week, one enthusiast asked whether how many partitions can be created per table. Before answering, I asked the same from the audience; and thought to make it as SQL Server Brain Basher of the Week.

What is the maximum number of partitions can be added to a SQL Server table?

There are two answers for this because it is based on the installation; whether it is 32-bit or 64-bit. You can create 15,000 partitions if it is 64-bit version of SQL Server. If it is 32-bit version of SQL Server, only 1,000 partitions can be created.

Note that Partitioning is available only with Enterprise Edition and Developer Edition.

For more info related to Maximum Capacity Specification, refer: https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx.

Saturday, May 16, 2015

What is a Junk Dimension?

Dimension represents entities that will be used for analyzing facts in the data warehouse. There are different type of dimensions like Regular, Slowly Changing, Self-Referencing and Junk. Many know about all of these except Junk.

Junk dimension is a dimension that combines multiple dimensions that have attributes with low cardinality and only few discrete values. This makes the process clean and reduce the complexity of the dimensional model.

Here is an example. Assume that three dimensions have been identified during requirement gathering process and they hold small number of values;


Rather than maintaining three tables for these three dimension, one dimension table can be introduced combining all values;


It represents all possible combinations of three dimensions and can be used with Fact tables without any issue. This dimension is called as Junk Dimension.

Thursday, May 14, 2015

What are the reasons for introducing a surrogate key to a dimension table?

When designing a dimension table such as Product and Employee in a data warehouse, a surrogate key is introduced for implementing an unique column for identifying each row uniquely in the entity. Even though we usually see this with almost data warehouses, it is not a must to implement a surrogate key. If the key of the source table related to the entity can be used as the key column of the dimension table, there is no necessity to introduce another column. How do we decide? What are the factors for deciding whether we need a new surrogate key for the dimension table or we can use the source key as the key column in the dimension table?


Here are some factors that can be used for deciding whether you need a new surrogate key (or artificial key) for your dimension table;

  1. Key of the source is a wide string such as CUS01X5679. It is not recommended to use a string data type as the dimension key unless it is fixed and small.
  2. Key of the source is a GUID. This is a 16-bytes value and it is too lengthy for the dimension key. In addition to that, if GUIDs in the source are not sequentially generated, it is not good for the key at all.
  3. Dimension table has to be loaded from multiple data sources. If table has to be populated from multiple sources, there is a chance to get same key from two differences sources for same record. This make impossible to make the source key as the dimension key because two records cannot be added with the same key.
  4. Dimension table is a SCD type 2 dimension. If the dimension table is a Slowly Changing Type 2 dimension, it needs to maintain same record with multiple versions addressing changes and maintaining the history. If the sources key is used as the dimension key, it is not possible to make another version for the same record with the same key.
  5.  Source reuses the key for new items. Some systems allow to delete records from entities and insert new records with the same keys that were used with the deleted records. Since the old record is already taken to dimension table, it is not possible to insert the new record with the same key if source key is the dimension key.
If you do not see any of reasons mentioned above, no harm, you can use the source key as the dimension key.

Tuesday, May 12, 2015

How to take a copy of database files while they are being used by SQL Server

I am sure that you have tried this at least once and have face the below issue.


Generally, we cannot copy database files or log files when they re being used by SQL Server, simply when the SQL Server service is running. If we need to get a copy of MDFs and LDFs, what we normally do is, stopping the service and copying files. However this is not possible with production databases, specifically when databases are used by 24x7. What is the solution for this?

There are many third party applications and tools available for getting this done. Some are paid tools and some are free tools. I have used one with SQL Server 2005 and 2008, unfortunately I cannot remember it now. While I am reading blogs, I came across this post: Backup/Copy Files that are "In Use" or "Locked" in Windows (Command Line) which is written on Copying Files. It explains how to use a tool called HoboCopy which is specifically for backing up / copying files but it can be used for copying our SQL Server files while they are being used.

In order to use HoboCopy, download the relevant file for you from this: https://github.com/candera/hobocopy/downloads,

Once downloaded, open Command Prompt as an Administrator and type the command as follows;

HoboCopy "FileFolderPath" "DestinationFolderPath" FileName.

For example, if you want to get the AdventureWorks2014 database file copied, you will be typing as;

HoboCopy "C:\Program Files\Microsoft SQL Server\MSSQL12,SQL2014_86\MSSQL\DATA" "C;\Test" AdventureWorks2014_Data.mdf




Note that, in order to run this command, you need to make sure Volume Shadow Copy and Microsoft Software Shadow Copy Provider services are running.

Monday, May 11, 2015

I want an execution plan optimized for my value, not for the value passed for parameters with first execution.

Since I got few questions on the posts written on stored procedure in last two weeks, thought to make few more posts on the same based on questions asked. One question asked was Can we make sure that the stored procedure is optimized for one range of values, not for exceptions. If we need to get the stored procedure optimized for one set of values, then instruction has to be passed during the creation. We should only do this if we know that there are different best plans for different values.

I have written a post called How to determine whether RECOMPILE is required? If so, at the creation or execution? and it shows the way of seeing whether different best plans are exist for different values. Have a look on it if you want to know the way.

Let's see how we can optimize a procedure for our values, not for values passed with first execution. Below code creates a stored procedure, an index on OrderDate

-- Clearing the cache  
DBCC FREEPROCCACHE  
GO  
  
USE AdventureWorks2014;  
GO  
  
-- Drop if it is exist  
IF OBJECT_ID('dbo.GetOrderDetails', 'P') IS NOT NULL  
 DROP PROC dbo.GetOrderDetails;  
GO  
  
-- Create the procedure  
CREATE PROC dbo.GetOrderDetails @FromDate datetime, @ToDate datetime  
AS  
    SELECT h.* 
 FROM Sales.SalesOrderHeader h
 WHERE h.OrderDate BETWEEN @FromDate AND @ToDate
 ORDER BY h.OrderDate
GO  

-- Create an index on OrderDate
CREATE INDEX IX_SalesOrderHeader_OrderDate ON Sales.SalesOrderHeader (OrderDate DESC)
GO

Nest code executes the stored procedure with three different values. As you see, WITH RECOMPILE has been added to see whether same plan or different plans are used for the execution.

EXEC dbo.GetOrderDetails '2014-06-29T00:00:00', '2014-06-30T00:00:00' WITH RECOMPILE;  
EXEC dbo.GetOrderDetails '2014-06-26T00:00:00', '2014-06-30T00:00:00' WITH RECOMPILE;  
EXEC dbo.GetOrderDetails '2014-06-01T00:00:00', '2014-06-30T00:00:00' WITH RECOMPILE;  
GO

Here are the plans generated.



As you see, plan is same for first two executions, and a different plan is for third execution. The reason for this is, first and second one try to get orders within few number of days and third one is for many days, hence newly added index is used for first and second one with seek operation and clustered index is used for third one with scanning

Assume that most requests made on this procedure are based on small number of days and we need to optimize it for small number of days. If so, we need to alter the procedure instructing that create a plan optimized for a sample range that contains small number of days. If we do not do it and the first execution is based on large number of days, plan will be created for it and the consecutive executions will use the plan created even for small number of days which is not the optimized plan for them.

Okay, how can we instruct then? Here is the way.

-- Alter the procedure  
ALTER PROC dbo.GetOrderDetails @FromDate datetime, @ToDate datetime  
AS  
    SELECT h.* 
 FROM Sales.SalesOrderHeader h
 WHERE h.OrderDate BETWEEN @FromDate AND @ToDate
 ORDER BY h.OrderDate
 OPTION (OPTIMIZE FOR (@FromDate = '2014-06-29T00:00:00', @ToDate = '2014-06-30T00:00:00'))
GO 

-- Execute with a large range and see
EXEC dbo.GetOrderDetails '2014-01-01T00:00:00', '2014-06-30T00:00:00';  
GO

Here is the plan;


I have used OPTIMIZE FOR hint by passing two values for parameters. Note that values are set for a small range and this instructs engine to create the plan based on these values without considering the values submitted with executions. As you see, when the procedure is executed even with a wide range, still engine uses the plan optimized for small range.

Read more on this hint at: https://msdn.microsoft.com/en-us/library/ms181714.aspx.

Sunday, May 10, 2015

SQL Server Brain Basher of the Week #011 - DBCC

An interview question.

What does DBCC stand for?

When classes related to Administration are being conducted, I always ask this question and as expected, I always get the same answer which is wrong. Most think that this is as Database Consistency Checker but that is the oldest way of describing this.

DBCC Command family was originally supporting Consistency related commands. Version upgrades added more and more commands to this family and they were not only for checking the consistency of our databases. Therefore, since SQL Server 2000, this was started referring as Database Console Commands. 


Friday, May 8, 2015

How to determine whether RECOMPILE is required? If so, at the creation or execution?

When we know that the parameters we pass for stored procedures are widely vary, we insist engine to recompile the procedure. But it does not mean that we have to recompile just because of the values are vary, we need make sure that there are different best plans for these values because there can be an instance where we see just a single plan best for all values. Therefore we need a way of seeing that there are different best plans for different values.

What is the simplest way? Easiest way is, execute the stored procedure with multiple different values with RECOMPILE option, enabling Actual Execution Plan. This tells us whether engines makes different plans for different execution indicating whether we need to have RECOMPILE option.

Here is the way;

-- Clearing the cache
DBCC FREEPROCCACHE
GO

USE AdventureWorks2014;
GO

-- Drop if it is exist
IF OBJECT_ID('dbo.GetOrderDetails', 'P') IS NOT NULL
 DROP PROC dbo.GetOrderDetails;
GO

-- Create the procedure
CREATE PROC dbo.GetOrderDetails @Number int
AS
    SELECT * FROM Sales.SalesOrderDetail 
    WHERE SalesOrderDetailID <= @Number
GO

-- Enable "Include Actual Execution Plan"
-- and execute all together
EXEC dbo.GetOrderDetails 1 WITH RECOMPILE;
EXEC dbo.GetOrderDetails 10000 WITH RECOMPILE;
EXEC dbo.GetOrderDetails 10 WITH RECOMPILE;
EXEC dbo.GetOrderDetails 1000 WITH RECOMPILE;
EXEC dbo.GetOrderDetails 100 WITH RECOMPILE;
GO

Let's have a look on plans generated;


Since this is a simple stored procedure, we do not see many different plans for values passed. But if you see many different plans and procedure is always executed with values that are vary, it indicates that we need to add RECOMPILE at the creation.

If you see a common plan with 80-90% of sample executions and see one or two different plans only few few executions, it indicates that RECOMPILE is required at the execution only when passing values similar to those executions.


Thursday, May 7, 2015

EXECUTE....... WITH RECOMPILE: Does it discard the previously cached plan?

Not an advance theory, everyone knows that first execution of the stored procedure creates the best plan for the execution and keeps it in the memory for the next execution. The plan is created based on the values passed to parameters with its first execution. This behavior is called as Parameter Sniffing and it is considered as a good thing as long as the cached plan is still good for parameter values passed with the consecutive execution. This behavior becomes bad when a completely odd value is passed for the parameter or the value of the parameter is vary with initial value. In a situation like this, we consider that the cached plan is not the best for this execution, hence we force SQL Server engine to recompile the plan with new values.

Forcing is done with RECOMPILE option. This option can be added to the stored procedure at its creation, resulting recompilation with all executions or can be added to the execute statement, resulting recompilation only with the current execution. Now the question is, when we force to recompile the stored procedure and generate a new plan by adding RECOMPILE option, does it discard the previously cached plan? Does it keep the new plan for next execution?

Let's test this. Following code adds a stored procedure and executes it three times. Then it checks cache.

-- Clearing the cache
DBCC FREEPROCCACHE
GO

USE AdventureWorks2014;
GO

-- Drop if it is exist
IF OBJECT_ID('dbo.GetOrderDetails', 'P') IS NOT NULL
 DROP PROC dbo.GetOrderDetails;
GO

-- Create the procedure
CREATE PROC dbo.GetOrderDetails @Number int
AS
    SELECT * FROM Sales.SalesOrderDetail 
    WHERE SalesOrderDetailID <= @Number
GO

-- Execute three times
EXEC dbo.GetOrderDetails 1;
GO 3

-- Seeing the cache
SELECT p.usecounts, p.cacheobjtype, p.objtype, t.text
FROM sys.dm_exec_cached_plans p
CROSS APPLY sys.dm_exec_sql_text(p.plan_handle) t
WHERE p.usecounts > 0 
 AND TEXT LIKE '%Sales.SalesOrderDetail%'
 AND TEXT NOT LIKE '%dm_exec_cached_plans%'
ORDER BY p.usecounts DESC;



As you see, plan has been created with its first execution and it has been used three times. Now let's try to use a different plan for executing the procedure. Let's force engine to recompile and check the cache again.

-- Execute with RECOMPILE
EXEC dbo.GetOrderDetails 1000 WITH RECOMPILE;
GO 

-- Checking the cache
SELECT p.usecounts, p.cacheobjtype, p.objtype, t.text
FROM sys.dm_exec_cached_plans p
CROSS APPLY sys.dm_exec_sql_text(p.plan_handle) t
WHERE p.usecounts > 0 
 AND TEXT LIKE '%Sales.SalesOrderDetail%'
 AND TEXT NOT LIKE '%dm_exec_cached_plans%'
ORDER BY p.usecounts DESC;


This shows the same plan and no new plan. What does it mean? If a new plan is created, why cannot we see it? If no new plan is created, has the old plan been used? If so, why we do not see an increase on usercounts for the plan? Answer is simple. RECOMPILE creates a new plan based on passed values but it does not keep it and it does not remove the old one from the cache. The newly created plan is only for this execution.

Wednesday, May 6, 2015

How to check parameters values used for compiling the stored procedure?

It is a known fact that the initial plan of the stored procedure is created with its first execution. SQL Server engines creates the plan based on the values passed for parameters with stored procedure's first execution and places the plan in the cache. This plan will be reused with consecutive executions regardless of the values passed to parameters. This behavior is called as Parameter Sniffing, and it helps to improve the overall performance of the stored procedure's execution as well as downgrade the performance of it. For more info on this, read one of my old post: Do we need to recompile Stored Procedures when an index is added or parameter value is vary greatly?

With consecutive executions, if you want to see the values that have been used for compiling the procedure, or in other words, parameters values used for the plan in the cache, what is the easiest way of seeing it? Simplest way is, get the Properties of last component in the plan and see Parameter List property. Here is the way of doing it;

First, let's create a procedure and execute once.

USE AdventureWorks2014;
GO

-- Drop if it is exist
IF OBJECT_ID('dbo.GetOrderDetails', 'P') IS NOT NULL
 DROP PROC dbo.GetOrderDetails;
GO

-- Create the procedure
CREATE PROC dbo.GetOrderDetails @Number int
AS
    SELECT * FROM Sales.SalesOrderDetail 
    WHERE SalesOrderDetailID <= @Number
GO

-- Execute once. This adds the plan which is optimized for 
-- parameter value 1.
EXEC dbo.GetOrderDetails 1;

Now let's execute it with a different value enabling Include Actual Execution Plan.


Open the Execution Plan tab and click on SELECT.



Properties window can be opened by either pressing F4 or Right-Clicking and Selecting Properties. Once it is opened, You will see property called Parameter List and it has the values used for compiling the stored procedure and values used for current execution.


Is this useful? Yes, if you are planning to recompile or want to know whether the plan is optimized for new executions, this holds some of required information you need on it.


Tuesday, May 5, 2015

SQL Server exams are getting retired within next 12 months

New exams are getting added and older exams are either getting revised or retired. Microsoft Learning has decided to retire some exams including few SQL Server exams and has planned dates for the retirement. Here are the SQL Server exams under this retirement plan;




Note that, if one of above exam is part of your certification, this does not mean that your certification is getting invalidated, it is still valid. If you have done one of these exams, do not worry, it will still display in your transcript even after the retirement. If you are planning to take one of these, do it before the retirement date.

For more info (to see other exams under this plan), visit: https://www.microsoft.com/learning/en-us/retired-certification-exams.aspx.

Sunday, May 3, 2015

SQL Server Brain Basher of the Week #010 - Best way to send datetime value

Exchanging data between systems is not an uncommon scenario and it has become a part of most of the business. When exchanging data, the structure and format of the output is sometime determined before exchanging or sometime destination just accepts as it comes. Transferring date/time values is always getting involved with these exchanges and in a way, it is the most common attribute that we frequently troubleshoot as it introduces issues during the transfer, hence thought to add something on it as this week's Brain Basher.

What is the best of sending a datatime value to another system?

There are many ways of transferring datetime values. Transfer it as a string value, as a numeric value based on a predetermined format, or a string value based on a predetermined format are some common methods. However, though we try to manage it our best, still there is a chance for identifying the datetime value incorrectly when transferring from one system to another. What is the solution for this? What if we can use a format that can be recognized by any database management system?

There is a way of passing datatime values formatted with a standard that can be recognized by almost all DBMSs. It is called ISO 8601 standard. Almost all system support functionalities for retreiveing stored data/time values under this standard. Once the value is received by the destination, destination system can recognize the pattern without misinterpreting the value. This is how it has to be formatted;



Here is a simple code that shows how to convert a datetime value to ISO 8601 and how to accept an ISO 8601 formatted value.

-- Converting to ISO 8601
SELECT CONVERT(char(23), GETDATE(), 126);

-- Retrieving ISO 8601 formatted value
DECLARE @StringValue varchar(30) = '2015-05-03T10:50:00.000'

-- This returns "May" recognizing middle part as the month
SELECT DateName(m, Convert(datetime, @StringValue));

Once I discussed this with one of the sessions hosted at SS SLUG. Here is the post I made on it: http://dinesql.blogspot.com/2013/12/best-way-to-pass-datetime-values-to-sql.html.

Thursday, April 30, 2015

Incremental ETL: How to identify changes for fact data when no support from the source

"Data warehouse is a read-only database": this is one of the characteristics of data warehouse and we used to say "History cannot be changed" for supporting this characteristic. Therefore we always try our best not to bring changes into the data warehouse.

This was thoroughly considered and applied with traditional data warehousing because the cost of processing OLAP models was very high. As I see, with modern implementations, this cost has gone down for two reasons; Tabular Model implementations instead of Multi-Dimensional Models and User-Specific Models with Self-Service BI. Therefore, if it is required, changes can be brought into the data warehouse for some extent considering both cost and benefits of it.

However, when we try to bring changes into the data warehouse, one major issue we face is, less support-ability from sources for finding changes in previous transactions that have already been loaded. Some sources maintain a timestamp indicating the changed date or a flag indicating whether the record is modified or not. If there is no such thing with the source, changes cannot be easily identified. In this scenario, one way of identifying changes is, checking and comparing each and every record, each and every coloumn in the fact table and seeing whether they are different. If the dataset is very large and data warehouse is large too, this is not much practical. Therefore either we should stop taking changes once data is loaded to the data warehouse or should take with some agreed rules.

Here is one of the ways I have used. First of all, we need to set an acceptable time period for accepting changes from the source. The agreed period can be 1 week, 1 month or 3 months. However this is based on certain factors such as business logic involved with transactions, complexity and volume. Once agreed, next step is, holding loaded data in the staging environment for the same period. If we have to expect changes for last three months, we need to make sure that staging database has last three months extracted data. In addition to that, we need to maintain an additional column for holding checksum. That is what we have to used for comparing records between the source and staging, not comparing each and every column.

For example, assume that Sales data related to FactSales in SalesDataWarehouse is loaded via a staging table which is called Sales. For that, considering the above requirement, the structure of the staging table must be created as below;

-- Staging table for holding sales
CREATE TABLE dbo.Sales
(
 SalesOrderID int NOT NULL
 , OrderDate datetime NOT NULL
 , ShipDate datetime NOT NULL
 , SalesOrderNumber varchar(25) NOT NULL
 , Product varchar(100) NOT NULL
 , ListPrice money NOT NULL
 , UnitPrice money NOT NULL
 , OrderQty int NOT NULL
 , LineTotal money NOT NULL
 , CheckSumValue int NOT NULL -- Checksum value from above columns
 , RecordType tinyint NOT NULL -- 1-New, 2-Modified, 3-Loaded
);
GO

-- Creating clustered index
CREATE UNIQUE CLUSTERED INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

-- Creating an index on Checksum column
CREATE INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

When we load new data, for an example, yesterday data, we load data into the staging table with RecordType = 1. Once inserted, must generate the Checksum value for newly inserted records.

UPDATE dbo.Sales
 SET CheckSumValue = Checksum(SalesOrderID, OrderDate
      , ShipDate, SalesOrderNumber
      , Product, ListPrice, UnitPrice
      , OrderQty, LineTotal)
WHERE RecordType = 1;

After that ETL module can access new records for loading the data warehouse. Once done,  RecordType must be set to 3. Now how do we handle changes for last three months?

For that, we need to get last three months transactions from the source again and calculate the checksum for extracted data. If SSIS is used for loading the staging, then we can load data into a Data Flow and generate the new Checksum. If we use a method like bcp or TSQL, it is better to load them into another temporary table. Once we have the dataset with new checksum, records can be checked and seen by comparing new checksum and old checksum, and update the records in the staging as below;

UPDATE dbo.Sales
 SET OrderDate = t.OrderDate
  , ShipDate = t.ShipDate
  , SalesOrderNumber = t.SalesOrderNumber
  , Product = t.Product
  , ListPrice = t.ListPrice
  , UnitPrice = t.UnitPrice 
  , OrderQty = t.OrderQty
  , LineTotal = t.LineTotal
  , RecordType = 2
FROM dbo.Sales s
 INNER JOIN TempSales t
  ON s.SalesOrderID = t.SalesOrderID 
   AND s.CheckSumValue != Checksum(t.SalesOrderID, t.OrderDate
      , t.ShipDate, t.SalesOrderNumber
      , t.Product, t.ListPrice, t.UnitPrice
      , t.OrderQty, t.LineTotal);

This updates RecordType as 2 indicating that the record is changed. Once updated, another ETL can access these records and update FactSales in the data warehouse.

This reduces the complexity related to finding changes and time and resources required for processing. However, since we use Checksum, for certain situation, same value might generated for the new value even though there is a change. If you experience it, it is recommended to use Hasbytes instead of Checksum.


Tuesday, April 28, 2015

ETL or ELT? Conversation I had with a participant

Everyone is familiar with the term "ETL" and everyone knows about the processes related to it. This concept refers extracting data from one or more heterogeneous sources, transforms them into a structure that destination can accept while applying all possible validations and business logic, and loads the destination. Is this always ETL? Can I refer this as ELT too? I referred this process as ELT in one of my recent presentations while describing Big Data and Microsoft APS (Analytics Platform System), and of course I was questioned by one of participants. Here is the conversation;

Participant: Hey Dinesh, it is a great session, learnt a lot from it, I need a clarification on an area you were discussing, have a minute?

Me: Thanks, sure, what is the area?

Participant: I have been implementing BI solutions for last 4 years and have done many different implementations on ETL side. During your presentation, you mentioned that the ETL term as ELT, not once twice, so, is ELT a known term or you just referred it?

Me: Yes, I referred it, and your first question, whether it is a known and standard term, I am 100% sure that it is a known term but I do not know whether it is a standard term. But I believe that the process can be referred as ELT based on the way it has been implemented.

Participant: Have you seen or heard this term before? Or you just created it :) ?

Me: :), I did not, this is being referred by many experts, I have seen it with videos and articles too. It is not an uncommon one but I think it is bit new to the industry, or it is not, but no one bothered to refer the process as ELT.

Participant: What this ELT exactly referring? Is it all about transforming loaded data violating the traditional rule?

Me: Yes it is, it refers applying transformation after loading your destination but it does not mean that we are violating any rules or standards. Think about loading data into staging environment. We load data and do some validation, apply some business logic to loaded data. If we just isolate "loading of staging", considering the staging as the destination, what we have done is ELT, not ETL.

Participant: Yes, I can understand, I have done the same too. How this involves with Big Data?

Me: I was explaining about Microsoft Analytics Platform System. With PDW (Parallel Data Warehouse Edition) or with standard Hadoop environment, we distribute the process into multiple nodes for getting the process done quickly. So, if the data warehouse is implemented with PDW, means millions of records, we can load data into distributed table in PDW database as the first step and then process them, applying complex business logic rather than doing them during the loading phase. This speeds up the whole ETL process. This is not extract, transform and loading, this is extract, loading and transform. And we can simply refer this as ELT.

Participant: Great, good explanation. Now it is clear. Thanks Dinesh 

Me: Thank you too, I always prefer to hear questions, it improves my knowledge and encourages me to do more exploration.


Sunday, April 26, 2015

SQL Server Brain Basher of the Week #009

When we need to create a table using a SELECT statement, the option we use is SELECT .... INTO. No arguments, for the most of situations, this is the best way for creating a table based on the data we load. However;

Can we create a table using CREATE TABLE statement and load data using SELECT statement executing all of them as a single statement in any Microsoft SQL Server database edition?

This is not possible with most of the editions we generally use but this possibility is available with one of the editions; PDW (Parallel Data Warehouse). This gives greater flexibility and more performance, specifically on ELT operations. This is really smart and can be used with many cases though there are few restrictions.

Here is a sample code;
create table SalesTemp
with (distribution = hash(DateKey)) 
as select *  from FactSales


Thursday, April 23, 2015

Do we need a staging database for warehousing projects?

Deciding an architecture for a data warehouse project and applying it is a challenge. The designer, or the architect has to decide many things, what to be included, what to be avoided, what, when and where the processes have to be implemented, so many decisions to be taken, and staging is another thing to be decided. This post speaks about it.

Do we always need a staging environment? Is it a must that has to be implemented as a part of data warehouse solution? Or can we simply ignore it?


Staging is not something that has to be considered as use-it-if-you-like or ignore-it-if-you-dont-like. The necessity of staging comes with few factors. Here are some factors that can be used for determining whether you need a staging environment for your solution.

  • Source data availability during non-processing hours
    Usually, we schedule the ETL process and it is generally off-peak hours. However, if acquisition window for data extraction from the source is during peak hours, then no options, data has to be extracted and stored in a temporary environment until the process starts. In a situation like this, it is better to use a staging database for holding data temporarily.

  • Difference partial-data acquisition windows for data extraction.
    There are some instances that you need a dataset for running an ETL for loading some facts but the full dataset is not available for one-time extraction. Assume that the source processes some data and loads the processed data into another table. If this process happens every two hours and the process flushes existing data before loading new data, and if this is what we need for loading facts, we need to continuously access the source table (that contains processed data) before the next process and hold them until we have the complete set. This requires staging environment for holding extracted data until the dataset is ready for running the ETL.
  • Source pushes data, we do not pull.
    Some sources do not support establishing connections to it and it pushes required data out. Most of the time, sources dump data as either csv or xml files but it might write data directly to a given destination. In order to support this, staging environment is needed because we never allow sources to write data directly into the data warehouse.
  • Multiple data acquisition windows for data extraction.
    Multiple data acquisition windows are common with single-source-extraction as well as multi-source-extractions. It is common to extract from multiple sources (or multiple entities in the same source) and process them together as a part of ETL operation. Since data extraction happens with multiple acquisition windows and all need to be processed at a scheduled time, data extracted has to be held until ETL starts. Because of this, staging environment is required for holding data.
  • Simple transformations.
    Generally transformation happens as part of standard ETL process but some transformation including validation can be done at a different level, either at the source level or an intermediate level before starting the standard ETL process. Good example is, combining all names columns such as first name, middle name, last name and forming a new column named Full Name. If this type of transformation cannot be done at source level, best place is then staging. Converting string-typed-date to date-type, removing nulls and correcting numeric values are common with this.
  • Auditing purposes.
    What if we need to keep records on what we extract from the sources for auditing purposes? Although data warehouse contains all extracted, it cannot be considered as the exact set extracted because transformation makes changes on them. Therefore, best option is, keeping them in the staging environment for a certain period.
  • Identifying data changes in sources.
    It is always better to extract only changes from the sources in order to reduce the load and process them only. But what if, source does not offer any way to identify changes and forces us to extract all. Situation like this is complex but we often use staging database to handle this up to some extent. For example, staging can help us to hold at least three months data for facts. When extract, ETL extracts for last three months data from the source and do the comparison between extracted dataset and data in the staging rather than comparing extracted dataset and data in the data warehouse for identifying changes. Once identified, filtered dataset from staging can be used as the source for continuing the ETL operation.
One more thing to be remember. Staging does not necessarily need to be a database. It can be a CSV file, XML file or SSIS Raw file too.

Tuesday, April 21, 2015

What is a bespoke provider?

Have you heard about this provider before? When we need to perform data extraction from another source, we need a provider for connecting with the source. Most of the time we use either ODBC or OLDE DB. But some data sources use proprietary storage (or custom storage) which needs a custom code or custom module for connecting with the source. This is called as bespoke provider.

This is term was initially used for custom-tailored clothes and it is pronounced as bee-SPOHK).

In many cases, if the source or the application supports extracting data as a report as saving it as either a CSV file or an XML, we tend to use that method for data extraction rather than going through bespoke provider. However if the provide given is smart and flexible, it is better to use it.