Dinesh's Blog :::: Being Compiled ::::: April 2015

Thursday, April 30, 2015

Incremental ETL: How to identify changes for fact data when no support from the source

"Data warehouse is a read-only database": this is one of the characteristics of data warehouse and we used to say "History cannot be changed" for supporting this characteristic. Therefore we always try our best not to bring changes into the data warehouse.

This was thoroughly considered and applied with traditional data warehousing because the cost of processing OLAP models was very high. As I see, with modern implementations, this cost has gone down for two reasons; Tabular Model implementations instead of Multi-Dimensional Models and User-Specific Models with Self-Service BI. Therefore, if it is required, changes can be brought into the data warehouse for some extent considering both cost and benefits of it.

However, when we try to bring changes into the data warehouse, one major issue we face is, less support-ability from sources for finding changes in previous transactions that have already been loaded. Some sources maintain a timestamp indicating the changed date or a flag indicating whether the record is modified or not. If there is no such thing with the source, changes cannot be easily identified. In this scenario, one way of identifying changes is, checking and comparing each and every record, each and every coloumn in the fact table and seeing whether they are different. If the dataset is very large and data warehouse is large too, this is not much practical. Therefore either we should stop taking changes once data is loaded to the data warehouse or should take with some agreed rules.

Here is one of the ways I have used. First of all, we need to set an acceptable time period for accepting changes from the source. The agreed period can be 1 week, 1 month or 3 months. However this is based on certain factors such as business logic involved with transactions, complexity and volume. Once agreed, next step is, holding loaded data in the staging environment for the same period. If we have to expect changes for last three months, we need to make sure that staging database has last three months extracted data. In addition to that, we need to maintain an additional column for holding checksum. That is what we have to used for comparing records between the source and staging, not comparing each and every column.

For example, assume that Sales data related to FactSales in SalesDataWarehouse is loaded via a staging table which is called Sales. For that, considering the above requirement, the structure of the staging table must be created as below;

-- Staging table for holding sales
CREATE TABLE dbo.Sales
(
 SalesOrderID int NOT NULL
 , OrderDate datetime NOT NULL
 , ShipDate datetime NOT NULL
 , SalesOrderNumber varchar(25) NOT NULL
 , Product varchar(100) NOT NULL
 , ListPrice money NOT NULL
 , UnitPrice money NOT NULL
 , OrderQty int NOT NULL
 , LineTotal money NOT NULL
 , CheckSumValue int NOT NULL -- Checksum value from above columns
 , RecordType tinyint NOT NULL -- 1-New, 2-Modified, 3-Loaded
);
GO

-- Creating clustered index
CREATE UNIQUE CLUSTERED INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

-- Creating an index on Checksum column
CREATE INDEX IX_Sales_SalesOrderID ON dbo.Sales (SalesOrderID);
GO

When we load new data, for an example, yesterday data, we load data into the staging table with RecordType = 1. Once inserted, must generate the Checksum value for newly inserted records.

UPDATE dbo.Sales
 SET CheckSumValue = Checksum(SalesOrderID, OrderDate
      , ShipDate, SalesOrderNumber
      , Product, ListPrice, UnitPrice
      , OrderQty, LineTotal)
WHERE RecordType = 1;

After that ETL module can access new records for loading the data warehouse. Once done, RecordType must be set to 3. Now how do we handle changes for last three months?

For that, we need to get last three months transactions from the source again and calculate the checksum for extracted data. If SSIS is used for loading the staging, then we can load data into a Data Flow and generate the new Checksum. If we use a method like bcp or TSQL, it is better to load them into another temporary table. Once we have the dataset with new checksum, records can be checked and seen by comparing new checksum and old checksum, and update the records in the staging as below;

UPDATE dbo.Sales
 SET OrderDate = t.OrderDate
  , ShipDate = t.ShipDate
  , SalesOrderNumber = t.SalesOrderNumber
  , Product = t.Product
  , ListPrice = t.ListPrice
  , UnitPrice = t.UnitPrice 
  , OrderQty = t.OrderQty
  , LineTotal = t.LineTotal
  , RecordType = 2
FROM dbo.Sales s
 INNER JOIN TempSales t
  ON s.SalesOrderID = t.SalesOrderID 
   AND s.CheckSumValue != Checksum(t.SalesOrderID, t.OrderDate
      , t.ShipDate, t.SalesOrderNumber
      , t.Product, t.ListPrice, t.UnitPrice
      , t.OrderQty, t.LineTotal);

This updates RecordType as 2 indicating that the record is changed. Once updated, another ETL can access these records and update FactSales in the data warehouse.

This reduces the complexity related to finding changes and time and resources required for processing. However, since we use Checksum, for certain situation, same value might generated for the new value even though there is a change. If you experience it, it is recommended to use Hasbytes instead of Checksum.

Tuesday, April 28, 2015

ETL or ELT? Conversation I had with a participant

Everyone is familiar with the term "ETL" and everyone knows about the processes related to it. This concept refers extracting data from one or more heterogeneous sources, transforms them into a structure that destination can accept while applying all possible validations and business logic, and loads the destination. Is this always ETL? Can I refer this as ELT too? I referred this process as ELT in one of my recent presentations while describing Big Data and Microsoft APS (Analytics Platform System), and of course I was questioned by one of participants. Here is the conversation;

Participant: Hey Dinesh, it is a great session, learnt a lot from it, I need a clarification on an area you were discussing, have a minute?

Me: Thanks, sure, what is the area?

Participant: I have been implementing BI solutions for last 4 years and have done many different implementations on ETL side. During your presentation, you mentioned that the ETL term as ELT, not once twice, so, is ELT a known term or you just referred it?

Me: Yes, I referred it, and your first question, whether it is a known and standard term, I am 100% sure that it is a known term but I do not know whether it is a standard term. But I believe that the process can be referred as ELT based on the way it has been implemented.

Participant: Have you seen or heard this term before? Or you just created it :) ?

Me: :), I did not, this is being referred by many experts, I have seen it with videos and articles too. It is not an uncommon one but I think it is bit new to the industry, or it is not, but no one bothered to refer the process as ELT.

Participant: What this ELT exactly referring? Is it all about transforming loaded data violating the traditional rule?

Me: Yes it is, it refers applying transformation after loading your destination but it does not mean that we are violating any rules or standards. Think about loading data into staging environment. We load data and do some validation, apply some business logic to loaded data. If we just isolate "loading of staging", considering the staging as the destination, what we have done is ELT, not ETL.

Participant: Yes, I can understand, I have done the same too. How this involves with Big Data?

Me: I was explaining about Microsoft Analytics Platform System. With PDW (Parallel Data Warehouse Edition) or with standard Hadoop environment, we distribute the process into multiple nodes for getting the process done quickly. So, if the data warehouse is implemented with PDW, means millions of records, we can load data into distributed table in PDW database as the first step and then process them, applying complex business logic rather than doing them during the loading phase. This speeds up the whole ETL process. This is not extract, transform and loading, this is extract, loading and transform. And we can simply refer this as ELT.

Participant: Great, good explanation. Now it is clear. Thanks Dinesh

Me: Thank you too, I always prefer to hear questions, it improves my knowledge and encourages me to do more exploration.

Sunday, April 26, 2015

SQL Server Brain Basher of the Week #009

When we need to create a table using a SELECT statement, the option we use is SELECT .... INTO. No arguments, for the most of situations, this is the best way for creating a table based on the data we load. However;

Can we create a table using CREATE TABLE statement and load data using SELECT statement executing all of them as a single statement in any Microsoft SQL Server database edition?

This is not possible with most of the editions we generally use but this possibility is available with one of the editions; PDW (Parallel Data Warehouse). This gives greater flexibility and more performance, specifically on ELT operations. This is really smart and can be used with many cases though there are few restrictions.

Here is a sample code;

create table SalesTemp
with (distribution = hash(DateKey)) 
as select *  from FactSales

Thursday, April 23, 2015

Do we need a staging database for warehousing projects?

Deciding an architecture for a data warehouse project and applying it is a challenge. The designer, or the architect has to decide many things, what to be included, what to be avoided, what, when and where the processes have to be implemented, so many decisions to be taken, and staging is another thing to be decided. This post speaks about it.

Do we always need a staging environment? Is it a must that has to be implemented as a part of data warehouse solution? Or can we simply ignore it?

Staging is not something that has to be considered as use-it-if-you-like or ignore-it-if-you-dont-like. The necessity of staging comes with few factors. Here are some factors that can be used for determining whether you need a staging environment for your solution.

Source data availability during non-processing hours

Usually, we schedule the ETL process and it is generally off-peak hours. However, if acquisition window for data extraction from the source is during peak hours, then no options, data has to be extracted and stored in a temporary environment until the process starts. In a situation like this, it is better to use a staging database for holding data temporarily.
Difference partial-data acquisition windows for data extraction.

There are some instances that you need a dataset for running an ETL for loading some facts but the full dataset is not available for one-time extraction. Assume that the source processes some data and loads the processed data into another table. If this process happens every two hours and the process flushes existing data before loading new data, and if this is what we need for loading facts, we need to continuously access the source table (that contains processed data) before the next process and hold them until we have the complete set. This requires staging environment for holding extracted data until the dataset is ready for running the ETL.
Source pushes data, we do not pull.

Some sources do not support establishing connections to it and it pushes required data out. Most of the time, sources dump data as either csv or xml files but it might write data directly to a given destination. In order to support this, staging environment is needed because we never allow sources to write data directly into the data warehouse.
Multiple data acquisition windows for data extraction.

Multiple data acquisition windows are common with single-source-extraction as well as multi-source-extractions. It is common to extract from multiple sources (or multiple entities in the same source) and process them together as a part of ETL operation. Since data extraction happens with multiple acquisition windows and all need to be processed at a scheduled time, data extracted has to be held until ETL starts. Because of this, staging environment is required for holding data.
Simple transformations.
Generally transformation happens as part of standard ETL process but some transformation including validation can be done at a different level, either at the source level or an intermediate level before starting the standard ETL process. Good example is, combining all names columns such as first name, middle name, last name and forming a new column named Full Name. If this type of transformation cannot be done at source level, best place is then staging. Converting string-typed-date to date-type, removing nulls and correcting numeric values are common with this.
Auditing purposes.
What if we need to keep records on what we extract from the sources for auditing purposes? Although data warehouse contains all extracted, it cannot be considered as the exact set extracted because transformation makes changes on them. Therefore, best option is, keeping them in the staging environment for a certain period.
Identifying data changes in sources.
It is always better to extract only changes from the sources in order to reduce the load and process them only. But what if, source does not offer any way to identify changes and forces us to extract all. Situation like this is complex but we often use staging database to handle this up to some extent. For example, staging can help us to hold at least three months data for facts. When extract, ETL extracts for last three months data from the source and do the comparison between extracted dataset and data in the staging rather than comparing extracted dataset and data in the data warehouse for identifying changes. Once identified, filtered dataset from staging can be used as the source for continuing the ETL operation.

One more thing to be remember. Staging does not necessarily need to be a database. It can be a CSV file, XML file or SSIS Raw file too.

Tuesday, April 21, 2015

What is a bespoke provider?

Have you heard about this provider before? When we need to perform data extraction from another source, we need a provider for connecting with the source. Most of the time we use either ODBC or OLDE DB. But some data sources use proprietary storage (or custom storage) which needs a custom code or custom module for connecting with the source. This is called as bespoke provider.

This is term was initially used for custom-tailored clothes and it is pronounced as bee-SPOHK).

In many cases, if the source or the application supports extracting data as a report as saving it as either a CSV file or an XML, we tend to use that method for data extraction rather than going through bespoke provider. However if the provide given is smart and flexible, it is better to use it.

Saturday, April 18, 2015

SQL Server Brain Basher of the Week #008

When you talk with your database, you use a language that is understandable to the database component you access and specific to the vendor. There are multiple vendor-specific languages such as T-SQL, PL-SQL and SQL PL for Microsoft SQL Server, Oracle and IBM DB2 databases respectively. Although there are different vendor-specific implementations, general categorization of the database language, in terms of command types and operation type is as follows;

DML

DDL
DCL

TCL

DQL

Do you know all of these languages? Can you describe what these acronym stand for and the purpose of them? Yes, that is the Brain Basher of the week.

Here is the explanation of each;

DML - Data Manipulation Language - This language is used for manipulating data stored in tables. Operations such as inserting data, updating data are done with this language and SQL Server specific commands fall under this language are INSERT, UPDATE and DELETE.
DDL - Data Definition Language - This helps us to define the structures used for holding data. This is used for operation like creating databases, creating tables and examples commands for this language are CREATE DATABASE, CREATE TABLE and CREATE VIEW.
DCL - Data Control Language - This language is mainly used for controlling the access permission on objects created in the database and operations that can be performed on them. Operations such as granting permission to a user on a table for updating records, allowing a group to execute a stored procedure and commands such as GRANT SELECT, GRAN ALTER are examples for this.
TCL - Transaction Control Language - Operations and commands related to transactions are come under this language. Starting a transaction with BEGIN TRAN and ending a transaction with either ROLLBACK TRAN or COMMIT TRAN are part of this language.
DQL - Data Query Language - This represents SELECT SQL command that is used for retrieving data from the database.

Thursday, April 16, 2015

Reporting Services Multi-Value parameter and Oracle database

Parameterization is not a difficult implementation with Reporting Services as most of the coding are automatically added either via the data source dialog box or the wizard. It adds necessary coding as well as report parameters. Not only single-select type parameters, it handles multi-select or multi-value parameters with appropriate filtering codes. However this becomes little bit complex when parameters are loaded from different sources. It becomes more complex (for me :)) when the selected values of parameters have to be passed to Oracle as the main source of the report is Oracle.

I had couple of parameters loaded from a source, to be precious, from Analysis Services dimensions and selected values have to be passed for a function implemented in a Oracle package for loading the main dataset. Similar implementation is required for few more reports hence I looked for a generic way to handle it. Searches made opened many ways of doing this, some of them are for specific scenario, some of them are shortcuts, some of them are combination of different approaches. Finally figured out a way to handle the situation, usefulness of solution for others is assumed, hence posting it.

Note that the way I have implemented Oracle objects is based on the knowledge gathered from internet without doing a study on Oracle in a structured and ordered manner, hence this will not be the best way of implementing this.

Okay, let's talk about the scenario again. I have two parameters, let's call them as rpProduct and rpCountry and "Allow multiple values" is checked on both. They are loaded from Analysis Services dimensions. Main dataset is loaded from an Oracle function called LoadSales which is in the package called pkgSales. The WHERE clause of the SELECT statement has to be handled for values coming from two parameters and better to have a generic way as same implementation is required for multiple reports.

This is what I did. Initially, I created a Type in Oracle database for holding multiple values. It is a Table type and I used varchar2(200) as the size for an individual item.

create or replace type t_string_table is table of varchar2(200);

Then I created a function that can be used for converting comma separated string values in to a collection. As you see, return type of it is the type I created with above code. I used some coding for this from: http://www.oratechinfo.co.uk/delimited_lists_to_collections.html

create or replace function Convert_stringlist_to_table(string_list in varchar2) return t_string_table is

   l_string varchar2(32767) := string_list || ',';
   l_comma_index pls_integer;
   l_index pls_integer := 1;
   l_tab t_string_table := t_string_table();
begin
  
         loop
           l_comma_index := INSTR(l_string, ',', l_index);
           EXIT WHEN l_comma_index = 0;
           l_tab.EXTEND;
           l_tab(l_tab.COUNT) := SUBSTR(l_string, l_index, l_comma_index - l_index);
           l_index := l_comma_index + 1;
         end loop;
      return l_tab;
end Convert_stringlist_to_table;

Next, changed the SELECT of LoadSales function. Function accepts two parameters, one for products and other for countries. Values are expected to be comma-separated. As you see, values are converted to collections using the function we wrote before using them with WHERE clause.

  function LoadSales
    (v_products in varchar2, v_countries in varchar2)
     return sys_refcursor
     is cur sys_refcursor;
  begin
    
    open cur for
      select {column list}
      from {table list}
      inner join .......
      where country_name in (select column_value  from TABLE(cast (Convert_stringlist_to_table(v_countries) as t_string_table)))
            and product_name in (select column_value  from TABLE(cast (Convert_stringlist_to_table(v_products) as t_string_table)));
    
    return cur;
  end;

Last step is configuring the dataset for parameters. Since we need to send selected values from Report Parameters as comma-separated string values, expression should be written as;

=Join(rpProduct.Lable, ",")

That is all. And good thing is, whole thing can be reused with all string type parameters related to other reports that require same implementation. You might have to change the code if you expect commas in parameter values.

Wednesday, April 15, 2015

What pushes SMP Architecture data warehouse to MPP Architecture

We have been developing data warehouses, centralizing enterprise data and addressing business intelligence requirements. Generally, almost all data warehouses built were based on traditional architecture which is called SMP: Symmetric Multi-Processing. Even though we use different design strategies for designing data warehouses for improving the performance and managing the volume efficiently, the necessity on scaling up often comes up. No arguments, without much considerations on factors related, we tend to add more resources spending more money for addressing the requirement but at a certain level, we need to decide, we need understand that the existing architecture is not sufficient enough for continuation, it needs a change, SMP to MPP.

What is SMP architecture? This architecture is a tightly coupled multi-processors that share resources, connecting to a single system bus. With SMP, system bus limits scaling up beyond a certain limit and, when number of processors and data load increases, the bus can become overloaded and a bottleneck can occur.

MPP, Massively Parallel Processing is based on shared-nothing architecture. MPP system uses multiple servers called Nodes which have dedicated, reserved resources and executes distributed queries with nodes independently offering much performance than SMP.

How do we know the boundary or what factors can be used for determining the necessity of MPP? Here are some, these will help you to decide.

Here is the first one;

This is all about data growth. Yes, we expect an acceptable data growth with data warehousing but if it increases drastically, and if we need to continuously plug more and more storage, it indicates a necessity of MPP. We are not talking about megabytes or gigabytes but terabytes ore more. Can't we handle the situation just adding storage? Yes, it is possible but there will be definitely a limit on it. Not only that, the cost goes up too. We do not see this limitation with MPP and in a way, adding additional storage after the initial implementation might not be as expensive as SMP.

Here is the second;

If somebody talks about BI today, the aspect of Real-time or Near-Real-Time is definitely a concerned area. Traditional implementation of data warehouse manages this up to some extent but not fully, mainly because of the capacity, loading and complexity of the query. Generally, with Microsoft platform, we use SSIS for data loading and de-normalized, semi-normalized tables designed either as general tables or star/snowflake structured tables for holding data. Assume that user requests real-time data and same fact table that contains billions of records have to be accessed, then performance of the query might not be at the expected level. With MPP, since data can be distributed with multiple nodes, performance on data retrieval is definitely fast and real-time queries are efficiently handled.

Third one;

Traditional data warehouse requires structured, in other words known relational formatted data. However modern BI is not just based on this, data with unknown structures are not rare and often required for most of the analysis. How do we handle this? One data warehouse for known, structured data and another for unstructured data? Even though we maintain two data warehouses, how an analysis can be performed combining these two? Can traditional tools attached with exiting architecture be used for combining them efficiently, process them fast and produce required result? No, it is not possible, and it means it is high time for MPP. This does not mean that MPP handles all these area but it supports. MPP helps to process unstructured data much efficient than SMP and Microsoft platform allows to combine structured and unstructured data with user-friendly interface using its solution which based on MPP.

Here is the forth:

Mostly, we extracts data from on-premises data storage and traditional ETLing handles this well. However, data generation is not limited to on-premises applications with modern implementation, many important data is generated with cloud applications. In a way, this increases the load as well as the complexity. And sometime this changes traditional ETL into ELT. MPP architecture has capabilities to handle these complexities and improve the overall performance, hence this reason can be considered as another reason for moving from SMP to MPP.

There can be some more reasons, but I think these are the significant ones. Please comment if you see more on this.

Sunday, April 12, 2015

SQL Server Brain Basher of the Week #007

Not a new one but still I do not get proper answers when a question is asked on this. See whether you can select accurate statement(s) before reading the rest. This is all about the "GO" keyword we used with TSQL.

What do you think? Let's go through one by one. GO is not a TSQL statement. It is a command only recognized by the sqlcmd and Management Studio. Based on this fact, 1st statement is false and 3rd statements is true. If you execute a TSQL code that contains GO keyword through C#.Net, you will get an error because engine cannot recognize the keyword.

GO is treated as the batch separator in Management Studio and SqlCmd tools. This makes the 2nd statement true. All your statements either start from the top or since last GO statement till next GO statement are considered as a batch and will be executed as a single batch. Objects like Variables cannot be shared among batches. Here is an example;

DECLARE @MyVariable int = 100;

-- This statement runs without any issue
SELECT @MyVariable;
GO -- Batch separator

-- This statement throws an error:
-- "Must declare the scalar variable "@MyVariable"."
SELECT @MyVariable;

Can we replace GO with any other word? Yes, it is possible. This option is available with Options menu item that comes under Tools menu in Management Studio. Considering this, forth statement is true too.

Let's form a team for a "Business Intelligence" project. Who do we need?

When forming a team for a Business Intelligence or Data Warehousing project, it is always advisable for filling the required roles with best and making sure they are fully allocated until the project is delivered, at least the first version. Not much differences with general IT projects but filling the roles with right person and right attitude should be thoroughly considered, organized and managed. This does not mean that other IT projects do not require same authority, but with my 14 years of IT experience, I still believe that Business Intelligence and Data Warehousing project needs an extra considerations.

How do we form the team? Who do we recruit for the team? What are the rolls that would play with this project? Before we selecting persons, before we forming the team, we should know what type of roles required for the project and responsibilities born by each role. That is what this post speaks about, here are all possible roles that would require for a Business Intelligence and Data Warehousing project.

Project team: Pic taken from http://www.usability.gov/how-to-and-tools/methods/project-team.html

A Project Manager

This role requires a person who possesses good communication and management skill. I always prefer a person with a technical background because it minimizes arguments on some of the decisions specifically on resource allocation and timeframes set. Project manager coordinates project tasks and schedule them based on available resources and he needs to make sure that deliveries are done on time and within budget. Project manager has to be a full-time member initially and at the last part of the project but can play as a part-time member in the middle of the project.
A Solution Architect
This roles requires a person who possesses at least 10 years experience in enterprise business intelligence solutions. Architect's knowledge on database design, data warehouse design, ETL design, model design, presentation layer design is a must and he should understand the overall process of the business intelligence project and should be able to design the architecture for the entire solution. She/he does not require to be a full-time member but project requires him as a full-time member from the beginning to completion of design phase. This role plays as a part-time member once the design is completed as she/he is responsible for the entire solution.
A Business Analyst
Business Analyst plays the key role of requirement gathering and documentation. This person should hold good communication and writing skills. In addition to that, needs experience in BI projects related requirement gathering processes and related documents. Business Analyst is responsible for the entire requirement, getting confirmation and sign off, and delivery in terms of the requirements agreed. This is a full-time member role and required until project completion.
A Database Administrator
Database Administrator does the physical implementation of data warehouse and maintains the data warehouse, ETL process and some models implemented. He needs a good knowledge on the platform used and specific design patterns used for data warehousing and ETLing. Generally the role of database administration does not consider as a full-time role for a specific project because in many situations, administrator is responsible for the entire environment that contains implementations of multiple projects.
An Infrastructure Specialist
This role requires at least 5 years experience in planning and implementing server and network infrastructure for database/data warehouse solutions. This person responsible for selecting the best hardware solution with the appropriate specification for the solution, right set up, right tool-set, performance, high availability and disaster recovery. This role is a not a full-time member for the project.
ETL/Database Developers
An engineer who possesses at least 3-4 years experience in programming and integration. He should be aware on design patterns used with ETLing and must be familiar with the tool-set and the platform used. In addition to knowledge of ETLing, this roles requires programming knowledge as some integration modules required to be written using managed codes. Experience with different database management systems and integration with them is required for this role and responsible for implementing the ETL solution as per the architecture designed. At the initial stage, this is not a full-time member role. Once this design phase is done, team includes this role as a full-time member.
Business Users
Project requires this role as a full-time member and business user is responsible for providing the right requirement. This roles requires a thorough knowledge on the domain and all business processes. He is responsible of what business intelligence solutions offers for business questions and offers are accurate.
Data Modelers
Data Modelers are responsible for designing data models (more on server level models) for supporting analysis done by business users. This roles requires technical knowledge on modeling and should be familiar with the tool-set used for modeling. This is not a full-time member role.
Data Stewards
Data Stewards can be considered as business users but they are specialized to a subject area. This is not a technical role and this role is responsible for the quality and validity of data in the data warehouse. This is not a full-time member role and they are also referred as data governors.
Testers
This role is responsible for the quality of entire solutions, specifically the output. Testers with at least 2-3 years experience are required to the project and they work closely with business analysts and business users. This role joins with the team in the middle of the project and work until the completion.

Saturday, April 11, 2015

Object Name Resolution - Stored Procedure

SQL Server does not resolve names of stored procedures just as it does on tables. This resolution is bit different. Here is the post I made on Object Name Resolution - Table: http://dinesql.blogspot.com/2015/04/object-name-resolution-table.html. Let;s see how this happens with Stored Procedure.

Have a look on the image below;

As per the example that shows in the image, Jane executes GetOrders stored procedure without adding its schema. In order to resolve this name, SQL Server initially adds sys schema. Since it does not success, it adds user's schema which is HR. As the stored procedure has been created under HR, Jane successfully executes the query.

Jack executes the same and SQL Server tries with sys.GetOrders. Since it is unsuccessful, SQL Server adds his schema which is FN. Again SQL Server cannot resolve an object called FN.GetOrders. SQL Server tries again with different schema which is dbo. That is the last try, and as it gives the same unsuccessful result, Jack gets an error.

Just like table names, makes sure you add schema to stored procedures when executing them. This makes sure that SQL Serve does not need to spend extra time for executing it and avoids overhead and errors.

Object Name Resolution - Table

How SQL Server resolves object names if you have not referred it is as a fully qualified object? If it is not a fully qualified object (a statement like below);

USE AdventureWorks2014;
GO

SELECT * FROM SalesOrderHeader;

it goes through additional steps for making it as a fully qualified object for resolving the name of the object. It is an additional cost for the execution engine and it adds extra time needed to execute the query. Not only that, if it cannot resolve the object, it throws an error saying Invalid Object even though the object is exist. Have a look on the image below.

Let's talk about the first example. Jane executes a query against SalesOrderHeader. She has not mentioned the schema of the table, hence it is treated as a non-fully-qualified-object. SQL Server tries to add her default Schema which is Sales. Once the schema is added, it becomes Sales.SalesOrderHeader and SQL Server could find the object. She executes the query successfully with an additional overhead.

Jack tries to execute the same query. Since Jack's schema is Finance, SQL Server initially tries as Finance.SalesOrderHeader. Since it does not success, it tries with its second Try which adds dbo schema. Again, dbo.SalesOrderHeader does not exist, hence it fails. Jack gets an error.

Realized how important it is to mention the schema? Make sure objects you have mentioned in your queries contains fully qualified object names for avoiding extra works and errors.

Wednesday, April 8, 2015

What are ROLLBACK WORK AND COMMIT WORK?

I am sure that you are familiar with ROLLBACK TRANSACTION and COMMIT TRANSACTION but have you heard about or used ROLLBACK WORK and COMMIT WORK?

ROLLBACK WORK and COMMIT WORK work exactly same way as ROLLBACK TRANSACTION and COMMIT TRANSACTION. WORK keyword is optional and this is ISO-compatible.

Can we use this instead what we have been used? Yea it is possible and no harm at all. Only missing part is, this does not accept user-defined transaction name.

Here is a sample code using ROLLBACK WORK and COMMIT WORK.

USE AdventureWorks2014;
GO

BEGIN TRAN

UPDATE Production.Product
 SET Color = 'b'
WHERE Color = 'Black';

-- simple logic to test
IF @@ROWCOUNT > 10
BEGIN

 ROLLBACK WORK;
 PRINT 'transaction rolled back';
END
ELSE
BEGIN
 
 COMMIT WORK;
 PRINT 'transaction committed.';
END

Tuesday, April 7, 2015

Some records are locked, can I skip them and read the rest? (READPAST)

Assume that you have a table with thousands of records and some of the records are being updated by another user. Now you try to read some records from this table. If your request needs to access records that are being updated by other user, your request will not be immediately satisfied, you have to wait till other user completes his transaction. This is the default behavior and this is what we referred as Blocking. Here is an example for it;

USE AdventureWorks2014;
GO

-- Starting a transaction and updating 93 records
BEGIN TRAN
 UPDATE Production.Product
  SET Color = ''
 WHERE Color = 'Black';

-- Open a new window and run this
-- You will not be able to see records
SELECT * FROM Production.Product;
GO

-- Stop the query and see locks placed
-- You should see the records exclusively locked
sp_lock;

What if you need to read only possible records? If you all need get your recordset and you have no issue with skipping locked records but get the available records without waiting, it can be done with a hint. The hint that has to be used for this is: READPAST.

READPAST instructs SQL Server Engine not to read records locked by other transactions and continue the reading process. Here is the same code using READPAST Table Hint.

-- Use the table hint.
-- This will skip 93 locked records
-- and show the rest
SELECT * FROM Production.Product (READPAST);
GO

Sunday, April 5, 2015

SQL Server Brain Basher of the Week #006

Let's talk about date and time. Have a look on following code and try to guess the value returned by last SELECT statement.

USE tempdb;
GO

CREATE TABLE dbo.TestTable
(
 Code char(4) NOT NULL
 , TimeRecorded datetime NOT NULL
)
GO

INSERT INTO dbo.TestTable 
 (Code, TimeRecorded)
VALUES 
 ('E001', '12/31/2014 23:59:59:999');
GO

SELECT TimeRecorded FROM dbo.TestTable;

The date and time we entered is a date related to 2014 but if you run the last SELECT statement, it will show a date for 2015, it will show the value as 2015-01-01 00:00:00.000.

How this can be happened? It can happen based on the "milliseconds" added to the value. What we have to remember is the Accuracy of DateTime data type. Accuracy of the datetime type is rounded to increments of .000, 003, or .007 seconds, hence .999 is rounded off to .000 increasing the second by 1. This changes the entire datetime value inserted. That is why we see 2014 date as 2015.

Considering this fact, if your application needs the accuracy at millisecond level, datetime2 should be used instead of datetime.

One more thing to remember. If you write a query like below to get some records related to 13th of June 2014 from a table;

SELECT Id
FROM dbo.Table
WHERE Date BETWEEN '06/13/2014 00:00:00:000' 
   AND '06/13/2014 23:59:59:999';

You might get records related to 14th of June too. Again the reason is same. It can be overcome by changing the query as below;

SELECT Id
FROM dbo.Table
WHERE Date >=  '06/13/2014 00:00:00:000' 
   AND Date < ‘06/14/2014 00:00:00:000’;

Friday, April 3, 2015

Cumulative updates for SQL Server 2012 and 2014

Few Cumulative updates have been released for SQL Server 2012 and 2014. Here are the links for them:

Cumulative update #6 for SQL Server 2014
Here is the link for downloading: https://support.microsoft.com/en-us/hotfix/kbhotfix?kbnum=3031047&kbln=en-us
Here is the link for the list of bugs fixed: http://support.microsoft.com/en-us/kb/3031047

Cumulative update #5 for SQL Server 2012 SP2
Here is the link for downloading: https://support.microsoft.com/en-us/hotfix/kbhotfix?kbnum=3037255&kbln=en-us
Here is the link for the list of bugs fixed: http://support.microsoft.com/en-us/kb/3037255

Cumulative update #15 for SQL Server 2012 SP1
Here is the link for downloading: https://support.microsoft.com/en-us/hotfix/kbhotfix?kbnum=3038001&kbln=en-us
Here is the link for the list of bugs fixed: http://support.microsoft.com/en-us/kb/3038001

What are hotfixes, Cumulative Updates and Service Packs?

No matter how hard the software has been tested, issues can be still encountered. Same applies to Microsoft SQL Server too and that is the reason we see numerous releases on same versions. These releases are published as a Hotfix, a Cumulative Update, or a Service Pack but what is the right time for applying these, should we apply them as soon as they are published?

Let's try to understand each of these SQL Server updates;

Hotfixes: These are built and released for addressing urgent customer concerns. Most of the time, this is done based on an issue reported by a customer (or few customers) and the fix is released immediately. Since this addresses a particular issue and the release is done in a urgent manner, not much testing is done hence it is not recommended to apply a hotfix to a production environment as soon as it is released unless the same issue is being experienced.
Cumulative Updates: Cumulative update is a bundle that holds a periodic roll-up releases of hotfixes. This goes through a test comparatively higher than testes done for each hotfix and the bundle is tested as a whole. In a way, this is more secure and advisable to apply rather than applying hotfixes individually. However, if possible, it is always better to wait till a service pack release as explained below.
Service Packs: Service pack is a periodic release that addresses many issues which may have (or may have not) addressed with previous hotfixes and cumulative updates. This goes through a thorough test and it is safe to apply this to production environment.

Thursday, April 2, 2015

Should I rollback my explicit transaction or will XACT_ABORT do it for me?

Here is another question I got in this week. The question is all about rolling back transactions, whether we need to write a code with ROLLBACK TRAN for rolling back or can SQL Server automatically rollback the transaction if XACT_ABORT is on.

Many believe that once the transaction is handled using TRY/CATCH block, rolling back is guaranteed when an error or unexpected issue is thrown. Most of the cases, this is true, when something goes wrong, execution jumps to Catch block and execute the rollback tran statement. But what if execution is unable to jump to Catch block? Can it happen? Yes, it is possible. Assume that logic is implemented with a stored procedure and it is executed by an application. While it is being executed, if something happens at the application-end, application disconnects from the database. This stops continuing the code and executing the rollback tran. Not only that, this results the transaction remain in the database, keeping all the locked placed without releasing.

This type of issues can be solved using XACT_ABORT statement. All we have to do is, add SET XACT_ABORT ON inside the stored procedure because XACT_ABORT makes sure all transactions are rolled back and the batch is aborted when an error occured. For more info, read : https://msdn.microsoft.com/en-us/library/ms188792.aspx

Does this mean that we do not need to have TRY/CATCH block and handling errors? No, we still need them because it is the way of capturing errors and handling them.

Here is an example. This creates a table and stored procedure for inserting records.

Code 1

USE tempdb;
GO

DROP TABLE dbo.Account

CREATE TABLE dbo.Account
(   
 AccountId INT NOT NULL PRIMARY KEY, 
 Name    NVARCHAR (50) NOT NULL,
 Balance Money NOT NULL   
)
GO

CREATE PROCEDURE dbo.InsertAccount
AS
BEGIN

 BEGIN TRY

  BEGIN TRAN

  INSERT INTO dbo.Account(AccountId, Name , Balance) 
  VALUES(1, 'Account1', 10000)
  -- This throws an error
  INSERT INTO dbo.Account(AccountId, Name , Balance) 
  VALUES(1, 'Duplicate', 10000)
  INSERT INTO dbo.Account(AccountId, Name , Balance) 
  VALUES(2, 'Account2', 20000)
 
  COMMIT TRAN
 END TRY
 BEGIN CATCH
  SELECT ERROR_MESSAGE()
  ROLLBACK TRAN
 END CATCH
END
GO

Now let's open a new connection and execute the stored procedure.

Code 2

USE tempdb;
GO

EXEC dbo.InsertAccount;

SELECT * FROM dbo.Account;

Here is the result.

As expected it throws an error. if you check for active transactions using below code, you should NOT see the transaction we started as it has been handled.

Code 3

SELECT 
       est.transaction_id as TransactionID,
       tas.name as [Transaction Name],
       DB_NAME(tds.database_id) as DatabaseName
FROM sys.dm_tran_active_transactions tas 
 INNER JOIN sys.dm_tran_database_transactions tds 
  ON tas.transaction_id = tds.transaction_id
 INNER JOIN sys.dm_tran_session_transactions est 
  ON est.transaction_id=tas.transaction_id
WHERE est.is_user_transaction = 1 
AND tas.transaction_state = 2 
AND tds.database_transaction_begin_time IS NOT NULL

Now let's make a small modification to the stored procedure and execute the code in a different way. Insert WAITFOR statement in between first and second INSERT statements like below and alter the procedure.

Code 4

  INSERT INTO dbo.Account(AccountId, Name , Balance) 
  VALUES(1, 'Account1', 10000)

  WAITFOR DELAY '00:00:20';

  INSERT INTO dbo.Account(AccountId, Name , Balance) 
  VALUES(1, 'Duplicate', 10000)
  INSERT INTO dbo.Account(AccountId, Name , Balance) 
  VALUES(2, 'Account2', 20000)

Now open a new connection and execute Code 2 again. Open another connection immediately and execute Code 3 too. You should see following record with Code 3.

Go back to Code 2 immediately and stop execution. You need make sure you do this within 20 seconds. Once you have stopped the code, go back to Code 3 and execute it again. You should see the transaction which is still exist.

Same can happen with an application too. We stopped the application but still the transaction is exist with the database, locking some resources. Let's make another modification to the stored procedure and test the same. Add following code to the procedure and recreate it.

Code 5

ALTER PROCEDURE dbo.InsertAccount
AS
BEGIN

 SET XACT_ABORT ON
 BEGIN TRY

  BEGIN TRAN

Now do the same thing, open a new connection, execute the Code 2, and with a different connection, execute the Code 3, go back to Code 2, stop it and execute Code 3 again. You should NOT see the transaction now because it has been taken care by XACT_ABORT.