Monday, July 6, 2015

SQL Server 2016 features: Polybase: SQL Server connector to Hadoop

Polybase is not a new product to Microsoft SQL Server product suite but it is new to many of us or it is not a well known component to developers as it was limited to one edition. Microsoft started adding Big Data related components to Microsoft platform long time back and as a result of it, HDInsight (Hadoop in Azure) and Analytical Platform System (Parallel Data Warehouse) appliance were introduced. Poybase was part of APS and it was limited to APS edition. In a way, it acted as a bridge between Hadoop and relational databases, allowing us to query files (all structured, semi-structured and unstructured) in the Hadoop using familiar language: SQL.

Good news is, now Microsoft has included this functionality to other editions as part of Microsoft SQL Server 2016. It allows you to process data in files (obviously large files) using SQL as they were tables in the database when files are stored in Azure Blob Storage or Hadoop.

As you know, SQL Server 2016 is still CTP and it is CTP 2.1 when this post is written. This post shows how to use Polybase when text files are stored in Azure Hadoop. Note that the code written is based on current version and implementation may be changed with future releases.

How do you start? First of all, you need SQL Server 2016 installed in your machine. Here are the steps for installing it: Installing SQL Server 2016 CTP 2.1 - Step by Step and Issues faced

Once installed, you need to enable hadoop connectivity using sp_configure. If you run sp_configure, you will see that zero is set with it.

Since I am going to access Hadoop configured in Azure, it needs a value like 4. For more information on values can be set, read this:

Once set, restart SQL Server Service and Hadoop related services (shown in the image below). Note that those services are not visible via SQL Server Configuration Manager.

In order to test Polybase, we need a HDInsight cluster configured with Azure. If have already created one, make a note on storage name and primary key of it. If you have not created, have a look on this post for creating one and accessing it via CloudXplorerHow to navigate HDInsight cluster easily: CloudXplorer

Let's create a simple text file for testing and place it in HDInsight. I have created a simple text file (of course, it is too small and not a best example for Hadoop, but will use it for testing) and uploaded to a new folder called testfolder in HDInsight using CloudXplorer.

Here are the next steps:

1. Enable 4631 flag using TRACEON and create a Credential for accessing Azure storage.

-- create a database for testing

-- connect with it
USE TestPolybase;

-- enable 4631
DBCC TRACEON(4631, -1);

-- create the master key

-- create credential for the azure storage
-- Note that even though this creates it
-- I had an error with CREATE EXTERNAL DATA SOURCE
-- hence, did not use this
-- changed core-site.xml instead

--CREATE CREDENTIAL Azure01DineshCredential
--WITH IDENTITY = 'azure01dineshhdistorage' -- storage name
-- , SECRET = 'abc23232njjj'; -- primary key

However, this did not work for me. CREATE CREDENTIAL allowed me to create it but I could not use it with CREATE EXTERNAL DATA SOURCE. It threw following error with it;

Msg 46516, Level 16, State 26, Line 16
The specified credential cannot be found

The reason for this could be, not adding credential to the database but to the server. However I could not add it to the database too;

Therefore, I added the credential manually to core-ste.xml file. File is located generally in:
 C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf 

For more info on this, read this:

2. Then we need to create an External Data Source. This sets the location we need to access for querying files. Note that I have not used CREDENTIAL keyword and LOCATION has been set with storage name and container name.

-- create the external data source
-- note that CREDENTIAL has not been used
 -- wasbs://
 LOCATION = 'wasbs://'
 --, CREDENTIAL = Azure01DineshCredential

For more info on this, read this:

3. Next step is, creating a File Format. This helps to understand the file we have stored when reading the file as a table.

-- create the file format required

For more info on this, read:

4. Now we need to create an External Table that references data stored in the location we specify. Here is the code for creating the table. Since I have stored the Log.txt file in testfolder created in HDInsight, Location is set with '/testfolder'. Data_Source is set with the one I created and File_Format is set with the format created. Reject_type and Reject_value indicate that ignore 10 invalid records.

-- create the table using file format created
-- and for the file uploaded
 id int
 , userid int
 , nofseconds int
 LOCATION = '/testfolder'
 , DATA_SOURCE = Azure01DineshHDIstorage
 , FILE_FORMAT = FileFormat

For more info on this, read:

All done. If you go through Object Explorer, you should see all we have created.

Now we should be able to query the file like below;

Still this is still being implemented, we cannot test many things against Hadoop but we can understand how it is going to be with future releases.

Sunday, July 5, 2015

SQL Server Brain Basher of the Week #019 - TSQL

Here is a challenge related to TSQL.

What would be the result of below query?

(1, 'abc'),
(2, 'lmp'),
(3, 'xyz')) AS d (Id, Value);

  1. Error that says Incorrect syntax near the keyword 'values'..
  2. Error that says Column name or number of supplied values does not match table definition.
  3. Error that says Incorrect syntax near the keyword 'SELECT'.
  4. Tabular resultset with two columns and three rows.
Though it looks like that the syntax of the code is wrong, it works without any issue. The reason for this is the enhanced VALUES clause. The VALUES clause was enhanced with SQL Server 2008 for supporting multiple row inserts with a single insert statement. This enhancement was not limited to it, it allows us to construct a derived table as if we construct a standard table value constructor.

Now, you know the answer. If you execute the above, you will see a resultset with two columns named as Id and Value with three rows.

Friday, July 3, 2015

What is Schema-on-write and Schema-on-Read?

When subjects related to Database Management Systems is discussed, the term Schema-On-Write is not widely used (or not popular), though that is what we have been using as a method for writing and holding data. This method forces the record to be matched with the defined schema before writing it into the storage. For example, in order to hold Customer data, we create a table in a traditional database with appropriate columns, with fixed data types. When a record is inserted, the record has to be aligned with defined columns (defined structure) and the record gets validated (via columns defined, type of them and constraints) before inserting to the table. This slows down the insert operation but consistence of the record is guaranteed. This method is called as Schema-on-write.

Schema-On-Read is a method that matches data to the schema as it is read from storage. Traditional database management system does not employ this but new platform like Hadoop uses methods like this for processing data, specifically on semi-structured and unstructured data. For example, if there is a dataset that holds some log records formatted as semi-structured or unstructured, content of it can be read with a defined schema as per the requirement. The requirement may need only few elements of the content and schema defined for reading addresses only required elements. There can be another requirement on the same content but needs to read differently, new schema is applied when reading. This slows down reading operation as data that is read has to be checked with defined schema.

Thursday, July 2, 2015

Installing SQL Server 2016 CTP 2.1 - Step by Step and Issues faced

Microsoft has made SQL Server 2016 CTP 2 available and it can be downloaded and installed for seeing new features. It is available for download at:

Since this is still CTP, you may get issues while installing and working with it, therefore it is always recommended to install this in a separate box. If you do not have a physical box, install it in a VM.

Here are the steps for installing SQL Server 2016 CTP 2.1. I have added only most relevant steps with issues I faced and workaround applied.

As usual, you need some prerequisites. Make sure following are installed with your machine (or VM) before launching installation.

Microsoft .NET Framework 3.5 Service Pack 1

Oracle JRE 7 Update 51 (64-bit)

Once installed, start with SQL Server installation setup;

1. Once the setup is launched, SQL Server Installation Center appears, select Installation tab and click on New SQL Server stand-alone installation or add features to an existing installation.

2. When you get Product Key page, DO NOT SELECT Specify a free edition. If you select it, installation might get stuck and you will not be able to continue.

Select Enter the product key radio button and click on Next. You do not need a key to continue.

3. Accept license terms and continue with License Terms page.

4. Next is Microsoft Update. It is always better to check for updates.

5. Continue with Install Setup Files.

6. Continue with Install Rules. Make sure that status of everything is passed.

7. Next is Setup Role. Select SQL Server Feature Installation and click Next.

8. Select your required features from Feature Selection. This is what I selected.

9. If all prerequisites are not installed, it will be reported with Feature Rules page. If no issues are reported, click on Next to continue.

10. Since it is a new installation, install it as the default instance. If need, it can be installed as a Named instance too.

11. Next is Server Configuration page that allows you to configure service accounts. Configure them as you want and continue.

12. Make sure Mixed Authentication is enabled with Database Engine Configuration page and if required, change folder path for databases with Data Directories tab. Add users who you need as administrators to the instance.

13. If Analysis Services is selected, Analysis Services Configuration page will appear. Select Server Mode and then add users who need administrative permission on Analysis Services if you have this page.

14. Select the Reporting Services Mode if you have selected Reporting Services as part of the installation.

15. Once everything is done, summary will be displayed, analyze it and continue.

16. Installation Progress page will show you the progress of the installation.

17. Once the installation is completed, final page will be shown with a message.

You can have a look on new things related to SQL Server 2016 with usual window: Management Studio. I have highlighted some new things appeared when opened it, you will find more.

At this moment, no AdventureWorks databases are available for 2016 but older version databases can be used for testing.

Wednesday, July 1, 2015

Why do we need Hadoop and What can we do with it? [Hadoop for SQL Developer]

If someone comes and tells;

We have a large dataset, let's use Hadoop for processing this.

What would be your answer? If you are an IT enthusiast, the most common and expected answer would be: Yes, why don't we? But is it the right way of using Hadoop? We love technology but it does not mean that we have to always go for the latest or most popular one even though it is not the right one for the current problem. You may solve the issue with traditional Database Management System without implementing Hadoop, or you may implement Hadoop because there is no other options and it is the right way of addressing the problem. How do you decide?

Let's try to understand Hadoop first:

Apache Hadoop is a scalable and fault-tolerance Open Source Framework that is highly optimized for distributed processing and storage run on inexpensive hardware (commodity hardware).

Okay, why we need it?

The simplest answer for this is, we can process a massive amount of data using Hadoop for seeing the insight, quickly and efficiently.

What else, is it all about processing data? Generally, we can use Hadoop for following;
  • Hadoop as an ETL platform
    A common requirement on ETLing with Big Data is, processing an unstructured, a large amount of data and make a structured result. This is not an easy operation with traditional DBMSs and ETL tools. Therefore, Hadoop can be used for processing an unstructured dataset and producing a structured dataset.
  • Hadoop as an Exploration Engine
    Analysis requires complex logic to apply on structured, semi-structured and unstructured data. Hadoop offers many tools for analyzing data efficiently, providing high performance on analysis as data stored in Hadoop cluster.
  • Hadoop as a data storage
    Since scalability is part of the Hadoop, a massive amount of data storing with Hadoop is beneficial. It automatically provides fault-tolerance and high availability. As data is replicated, at least with three nodes, data read operations can access the best node, even when one node is not available due to a fault. Not only that, when more space is required, more nodes can be added without any limitations for expanding the cluster.

Tuesday, June 30, 2015

Double click on .sql file opens a blank Management Studio Windows

This happens mostly with older version of SQL Server such as 2005. When a SQL file with .sql extension is double-clicked, it opens Management Studio but file does not get opened. If you experience the same, all you have to do is, modify the registry entry in sqlwb.sql.9.0. It requires to add "%1" as a placeholder to sqlwb.sql.9.0 -> Shell -> Open -> Command -> Default as below.

"C:\Program Files (x86)\Microsoft SQL Server\90\Tools\Binn\VSShell\Common7\IDE\sqlwb.exe" /dde

"C:\Program Files (x86)\Microsoft SQL Server\90\Tools\Binn\VSShell\Common7\IDE\sqlwb.exe" "%1" /dde

Monday, June 29, 2015

.sql files are opened with older version of Management Studio, how can I get it opened with new version of Management Studio?

If you have two or more versions of SQL Server installed in your machine, most of the time, when you open a .sql file by double-clicking on it, it gets opened with older version of SQL Server. This happens because of the registry-entry registered with the registry. If you are using latest operating systems like Windows 8, then it can be simply changed by accessing Control Panel -> Default Programs -> Set Associations. It allows you to set the application for .sql extension, you can either pick one from the list or browse for ssms.exe.

If the 5 Operating System is different and you have no way of setting it, you can try changing the registry entry and see. Not sure whether this is the right way of setting it but it worked for me. If you have to do it, test it in an environment configured for testing before applying it to production environment.'

If you navigate through registry (you can open it via regedit.exe), HKEY_CLASSES_ROOT has a node for .sql.

As you see, default is set to ssms.sql.12.0. If you scroll down through nodes, you will see this node and how it is registered.

Now if you want to get your .sql opened using a different version, find a relevant node (like ssms.sql.12.0, ssms.sql.11.0) and set it with .sql.

SQL Server Brain Basher of the Week #018 - Single insert with multiple rows

Here is a basic concept you should know. It is about inserting records, specifically inserting multiple records with a single insert statement

Can we insert multiple rows using a single insert statement?

Yes, this is possible, however code has to be constructed as per the version of SQL Server. SQL Server 2008 started supporting an enhanced VALUES clause that allows us to submit multiple rows separated by comma. If the version is lower than 2008, UNION ALL has to be used for constructing the code that helps us to build a virtual table on values.

Note that both these operations are processed as an atomic operation, hence a failure on one row fails the whole operation, inserting none of the records into the table.

Here is a sample code.

-- Connecting with the temp table
USE tempdb;

-- Creating a test table
CREATE TABLE dbo.TestTable
 , Value varchar(20) NOT NULL

-- Inserting multiple rows with enhanced VALUES clause
-- This is possible only with SQL Server 2008 or later
INSERT INTO dbo.TestTable
(1, 'abc'),
(2, 'abc'),
(3, 'abc')

-- Inserting multiple rows with UNION ALL
-- This is what we have used with older version
INSERT INTO dbo.TestTable
SELECT 6, 'abc';

SELECT * FROM dbo.TestTable;

Sunday, June 28, 2015

Indexes should be dropped before a bulk insert or not

If we are loading a large volume of data into an indexed table, we always drop the index, load the dataset, and re-create indexes assuming that the overhead of dropping and re-creating indexes is less than the overhead of loading data with the indexes in place. But for certain situations, this assumption may not give any benefits. When loading a small dataset, dropping and re-creating may be counterproductive, and may take more time for re-creating than the time it takes for loading data with indexes in place.

Considering that, how do we determine whether we should drop the indexes before loading or insert all records with indexes in place?

Microsoft has given set of guidelines for this. It is based on minimal amount of new data to be loaded, proportion to the total data in the table. If your new data load is equal or greater than the percentage given, it is recommended to drop the indexes before loading and re-create them after loading.

Indexes Relative amount of new data
Clustered index only 30%
Clustered and one nonclustered index 25%
Clustered and two nonclustered indexes 25%
Single nonclustered index only 100%
Two nonclustered indexes 60%

For example, if you have 100,000 records in the table that has a clustered index only and have to load 3,000 new records, it is better to drop the index before loading and re-create it afterward.

You can read more info on this at:

Saturday, June 27, 2015

Ways of incrementally loading data to a data warehouse

Once I wrote a post on Incremental ETL (Incremental ETL: How to identify changes for fact data when no support from the source) that discusses how to use Checksum for identifying changes before transferring the extracted data from the source, specifically when source does not support identifying changed records. This raised a question: what are the other ways of incrementally loading a data warehouse, hence this post speaks about them.

There are multiple ways of loading incremental data into data warehouse. One has to be picked based on many factors such as performance, whether data warehouse accepts updates, whether data warehouse maintains history and whether staging environment is maintained (Do we need a staging database for warehousing projects?). Here are most common ways used;

CDC enabled source tables
If the source table is enabled with CDC (Changed Data Capture), then changes can be easily identified and warehouse table, either dimension or fact can be updated appropriately. If ETL is written with SSIS, CDC related tasks can be used for identifying changes and splitters can be used for identifying the type of the change such as INSERT, UPDATE, DELETE and the update the warehouse table accordingly.

The Slowly Changing Dimension transformation
This is the most common used one for loading dimension (not very large dimension) tables as it supports identifying changes comparing source and warehouse, doing inserts and updates, and most importantly handling type 2 type slowly changing dimensions. This does not require source maintaining high water mark and can be easily implemented using the wizard given with SSIS.

The MERGE statement
This T-SQL statement allows us to perform insert, update and delete operations using a single statement combining source and destination (warehouse). Since the code has to be executed with SQL Server engine, SQL Server instance that hosts the warehouse should be able to access the source. This accessibility is possible when source is in one of the databases in the same SQL Server instance or source can be linked as a linked server to SQL Server instance. The most common practice is, loading data from the source and performing necessary transformations using SSIS and then transformed data is loaded to a temporary table in the warehouse. A stored procedure in the warehouse then performs merging using MERGE statement.

Using Checksum
This technique is used when there is no way of identifying changes at the source and load is too heavy for Slowly Changing Transformation. This is usually implemented with staging environment and checksum value is used for identifying whether the record is changed or not. Read more on this at:

The Lookup transformation
This SSIS transformation is used for smaller dimensions. This is not specifically given for handling this scenario but it can be used for comparing source data with existing data in the warehouse and find matching records.

Considering all five above, me preferred ones are Checksum and MERGE because they offer flexibility, maintainability and most importantly good performance.

Friday, June 26, 2015

Does Change Data Capture (CDC) capture TRUNCATE operation?

One of the sessions we had with SQL Server Sri Lanka User Group Meet-up yesterday was on Change Data Capture which is widely used for capturing changes on table modifications. At the end of the session, there were few questions on it for the sepaker, and I was asked a question on it too. The question was "Does CDC capture TRUNCATE operation?".

Considering the factors related to CDC, I gave the answer as "It should be" but later I realized that it may not possible because TRUNCATE does not log records immediately in the log file. In this case, what is the response from CDC for TRUNCATE?

Following code creates a database and a table, and enable CDC. The it inserts records and see how CDC has captured them.

-- create a database for testing

-- Connect with the database and create a table
USE TestDatabase;

CREATE TABLE dbo.Product
 , Code char(5) NOT NULL
 , Name varchar(50) NOT NULL
 , Color varchar(20) NOT NULL

-- Enable CDC for the database
EXEC sys.sp_cdc_enable_db;

-- Enable CDC for the table
-- Make sure Agent is running before executing this
EXEC sys.sp_cdc_enable_table 
   @source_schema = N'dbo'
 , @source_name   = N'Product'
 , @role_name     = NULL
 , @supports_net_changes = 1;

-- Inserting records
INSERT INTO dbo.Product
  (1, 'P0004', 'Product 1', 'Black')
, (2, 'P0004', 'Product 2', 'White')
, (3, 'P0003', 'Product 3', 'Red')
, (4, 'P0004', 'Product 4', 'Black');

--Define the LSN range 
DECLARE @from_lsn binary(10), @to_lsn binary(10) 
SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_Product') 
SET @to_lsn   = sys.fn_cdc_get_max_lsn() 

-- Get all the changes for the table 
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Product (@from_lsn, @to_lsn, N'all') ;

Now let's delete all records and see how it captures.

-- deleting all records
DELETE dbo.Product;

--Define the LSN range 
DECLARE @from_lsn binary(10), @to_lsn binary(10) 
SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_Product') 
SET @to_lsn   = sys.fn_cdc_get_max_lsn() 

-- Get all the changes for the table 
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_Product (@from_lsn, @to_lsn, N'all') ;

Let's insert all records again and truncate the table.

-- Inserting records
INSERT INTO dbo.Product
  (1, 'P0004', 'Product 1', 'Black')
, (2, 'P0004', 'Product 2', 'White')
, (3, 'P0003', 'Product 3', 'Red')
, (4, 'P0004', 'Product 4', 'Black');

-- truncating the table

As you see, truncation is not possible with this table. The error message clearly says that truncate operation is not allowed with CDC enabled tables.

For the question I was asked, the answer I gave should be corrected. The answer should be "truncate operation is not possible with CDC enabled tables".

Wednesday, June 24, 2015

SQL Server Integration Services (SSIS) Feature Pack for Azure is released

Microsoft has released SSIS Feature pack for Azure that allows us to connect with Azure, transfer data between azure and on-premises data sources, and most importantly process data using HDInsight. It release includes following components;

Connection Managers

  • Azure Storage Account Connection Manager 
  • Azure Subscription Connection Manager


  • Azure Blob Upload Task 
  • Azure Blob Download Task
  • Azure HDInsight Hive Task
  • Azure HDInsight Pig Task
  • Azure HDInsight Create Cluster Task
  • Azure HDInsight Delete Cluster Task

Data Flow Components

  • Azure Blob Source
  • Azure Blob Destination

Azure Blob Enumerator 

  • for Foreach Loop Container

Download the feature pack for SQL Server 2012 from here
Download the feature pack for SQL Server 2014 from here
Once it is installed, items should be appeared as below images;

Control Flow Items:

Data Flow Items:

Tuesday, June 23, 2015

How to block users updating records I have selected/read

We know that, with the default behavior, records are getting exclusively locked when we perform modifications against data, disallowing other users to access them. Until we complete the transaction, exclusive locks will be held by SQL Server, making sure that no one not only modify but execute SELECT statements too. However default settings does not hold locks obtained for SELECT statements blocking other users reading data you have read. But what if you need it?

If you want to make sure that no one modify records which you have read for your operation, you need to change the default Isolation Level and select data within a Transaction. The Isolation Level determines the behavior of concurrent users who read or write. Default Isolation Level is READ COMMITTED which locks records using shared locks for reading but does not hold them until the transaction is completed. In order to keep the obtained shared locks until the transaction is completed, Isolation Level has to be brought up to REPEATABLE READ Isolation Level.

REPEATABLE READ keeps acquired shared locks for read data till the end of transaction. This makes sure that no one can modify records we have read and avoid a concurrency issue called Inconsistency Analysis.

Here is the way of doing it.

Open a new connection and execute the following;

-- First connection
-- setting the isolation level

-- start a transaction

 -- executing a select query
 FROM Sales.SalesOrderHeader
 WHERE YEAR(OrderDate) = 2011;

Then try to modify a record related to above query with another connection.

-- Second connection

-- Try to retirve data
-- This works
FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) = 2011;

-- Try to update a record
-- This will NOT work
UPDATE Sales.SalesOrderHeader
 SET ShipDate = DATEADD(dd, 1, shipdate)
WHERE SalesOrderID = 43660;

As you see, second connection will not be able to modify records until the transaction related to first connection is done.

Sunday, June 21, 2015

SQL Server Brain Basher of the Week #017 - Alternative names for objects

Microsoft SQL Server does not force you with hard rules for naming objects such as tables but some recommendations. Your tables, your procedures, you functions can be named as you want and we generally use Pascal notation for naming objects (for more info on notations: Camel notation, Pascal Notation, Hungarian Notation). Sometime this leads to have very lengthy names and not user-friendly in some context. The Brain Basher of the week is based on it;

Can I have alternative names for my objects?

This can be achieved with Synonym Objects. It allows us to add another name on objects that are exist in either local or remote server. For an example, assume that you have a table called dbo.SalesOrderHeaderForAsiaRegion and you want to have a shorter name for this, a synonym can be created on this as AsiaSales. Another example is, assume that you need to access a table called Customer exist in your database and link with another table called Customer in a different database in another instance, a synonym can be created for second Customer table exist in second instance on first instance.

Here is a sample code for this;

USE AdventureWorks2014;

-- Check existing table
SELECT * FROM Sales.SalesOrderHeader;

-- Create synonym on it
FOR Sales.SalesOrderHeader;

-- Checking synonym created
SELECT * FROM SalesHeader;

For more info on Synonym:

Friday, June 19, 2015

Can I create a permanent table inside the tempdb?

Microsoft SQL Server comes mainly with five system databases: master, model, msdb, tempdb and resources. In addition to that some more databases are created with different implementations such as replications. During a discussion, a question related to tempdb came up: What do not we maintain a permanent table inside the tempdb if we need to maintain some data not related to business-related database?

First thing we need to understand is, how these system tempdb works. When SQL Server services is restarted, tempdb gets recreated using the template which is model. Since it is getting recreated, we lose all our works we have done with tempdb before the restart. Considering this fact, should we create a table inside tempdb, other than temporary tables that are prefixed with either # or ##?

Here is an example, this code creates a table inside the tempdb and then it shuts down the service.

-- Connecting with tempdb
USE tempdb;

-- Creating a table (permanent)
CREATE TABLE dbo.TestTable
(Id int);

-- Query and see, this works
SELECT * FROM dbo.TestTable;

-- Sending shutdown singnal

Once manually restarted, we can query the same table and see, result is shown below.

-- Connecting with tempdb
USE tempdb;

-- Query and see, this works
SELECT * FROM dbo.TestTable;

Do you still want to create permanent tables inside the tempdb?

Thursday, June 18, 2015

Looking for a Hadoop Cluster for testing? Let's configure Hortonworks HDP Sanbox

The latest buzzword in IT, or more particularly in data analytic is Big Data. It does not come alone, it always comes with Hadoop which offers distributed storage and processing. Everyone loves to do some experiments with new technologies, or popular technologies, hence everyone loves to do the same with Big Data and Hadoop. But setting it up is not an easy task and cloud HDP subscriptions offered by various vendors are not so flexible in terms of trial-period given. However, if you really want, you can use Sanboxes offered by some vendors for testing Hadoop implementations.

One of the HDP cluster I used for testing is HDInsight. Since the trail is limited, searched for alternatives and found the Sandbox offered by Hortonworks. This sandbox is configured as a self-contained virtual machine and it can be simply used without connecting to cloud. It does not come with multiple nodes, means that all Name Node, Job Tracker, Data Node, etc. are in same virtual machine. You will not be able to get the exact picture of distribution but you can do everything you need to do with Hadoop with this.

Here are the steps for configuring HDP sandbox.

Visit Click on Hortonworks Sandbox under Get Started menu that is the top menu.

This takes you to a Download and Install page. At this moment, HDP 2.2.4 is the stable and reliable version, but it offers HDP 2.3 - Preview too. HDP 2.2.4 comes in three flavors; VirtualBox, VMWare and HyperV. Download the best suited for you.

Make sure you download the Install Guides too. Configuration and usage is same for all three types, for this post, I will assume that you download VirtualBox virtual machine which is Sandbox_HDP_2.2.4.2_VirtualBox.ova. Once this is downloaded, you need to import it, it is fairly straight forward, all instructions are given with the Installation Guide, follow it for importing into your VirtualBox environment (or VMWare or HyperV).

After importing, all you have to do is, click on Start button. If your machine is ready for running virtual machines, it should start without any issue, however it is very common to see following error message with most of VMs;

Failed to open a session for the virtual machine Hortonworks Sandbox with HDP

VT-x is not available. (VERR_VMX_NO_VMX).

Result Code: E_FAIL (0x80004005)
Component: Console
Interface: IConsole {8ab7c520-2442-4b66-8d74-4ff1e195d2b6}

There can be two reasons for that. Once is, not enabling Visualization in BIOS. Second can be, incompatibility with other virtual environments. If Virtualization is not enabled in your machine, boot the machine with BIOS and enable it. If you still get the same error, and if you are running with Windows 8, make sure you disable HyperV. This thread discusses the same, follow it:

And this video shows how to disable HyperV for addressing the same error:

Once everything is done, you should be able to start it and you should see a scree like below.

As it says, there are two ways of accessing this; you can press ALlt+F5 for logging, user id is root and password is hadoop.

Once login you can continue with your commands for working with Hadoop.

In addition to that, GUI is given too. As the first screen explains, open a browser and go for

Then click on for opening hue (Hadoop User Experience). It allows you to do your Hadoop work easily.

Wednesday, June 17, 2015

Generate comma-separated value as a part of GROUP BY

This is not a new thing but it is a common requirement with many implementations, so making a post on it.

Assume that you need to get some values in the group as a comma-separated value instead of running them in an aggregate function. There are  many ways of handling this, some are complex and some are straight-forward. If it is not a complex GROUP BY and table is relatively small, following is the best I have used.

Let's say, we have an EmployeeTelephoneNumbers table like below;

EmployeeId PhoneNumber
1 0777222333
2 0777343565
2 072345345
2 0112456456
3 071578976
3 01154672121

And need to produce a result like this;

EmployeeId PhoneNumber
1 0777222333
2 0777343565, 072345345, 0112456456
3 071578976, 01154672121

As usual, we have to use GROUP BY and perform a different logic for forming the second column without using an aggregate function. For that, first we need to use is FOR XML clause, which allows us to convert the tabular resultset into an XML. It has different modes such as RAW, AUTO, EXPLICIT and PATH and PATH5 has to be used with this.

Next is STUFF function. This function inserts a string into another string deleting the given length.

For more info on STUFF function:

What we have to do is, use FOR XML PATH for generating an XML that contains all values for grouping and using it as a string. Then STUFF will be used for removing the extra characters.

Here is a sample code.

USE tempdb;

-- creating the table
CREATE TABLE dbo.EmployeeTelephoneNumbers
 EmployeeId int not null
 , PhoneNumber varchar(200) not null

-- Inserting sample records
INSERT INTO dbo.EmployeeTelephoneNumbers
 VALUES (1, '0777222333'), (2, '0777343565'), (2, '072345345'), (2, '0112456456'), (3, '071578976'), (3, '01154672121');

-- Generate the result
 , STUFF((SELECT ', ' + PhoneNumber
   FROM dbo.EmployeeTelephoneNumbers
   WHERE EmployeeId = t.EmployeeId
   FOR XML PATH('')), 1, 2, '') PhoneNumbers
FROM dbo.EmployeeTelephoneNumbers t
GROUP BY EmployeeId;

Tuesday, June 16, 2015

Different query patterns you see with HiveQL when comparing with TSQL

In recent past......... a casual conversation turned to a technical conversation......

Friend: It is a surprise to see an MVP studying hard on Open Source, anyway good see you in the club :).

Me: This does not mean that I will be completely moving to Open Source, but I like to lean some additional things and get involved with a good implementation.

Friend: Okay, you have been studying HIVE for last few weeks, tell me three types of queries you have not seen with SQL Server?

Me: Yes, have seen few, let me remind three, you can interchange blocks in SELECT like FROM clause before SELECT, one CREATE TABLE statement for creating table and loading data, and of course duplicating table with CREATE TABLE LIKE :).

Conversation continued with more interesting topics related to two different platforms, will make more posts on my-usage of open source components such as Hadoop, Pig and HIVE for solutions I work on, but thought to make a post on this three items;

There are no big differences between TSQL (or standard SQL) and HiveQL queries, but some are noticeable if you are an experienced TSQL developer. If you have used MySQL, you will not see much as HiveQL offers similar patterns.

Let me elaborate three things I mentioned with my reply, that will surely encourage you to start studying on HIVE if you have not started using it.

SELECT statement of SQL Server always start with SELECT clause and then FROM;

SELECT Col1, Col2
FROM Table1
WHERE Col3 = 1;

HiveQL allows the same but it can be written like this too;

FROM Table1
SELECT Col1, Col2
WHERE Col3 = 1;

If we need to create a table with SQL Server and load data, there are two ways with pros and cons.

 Col1 int
 , Col2 varchar

FROM OldTable;

-- or

SELECT Col1, Col2
INTO NewTable
FROM OldTable;

But HIVE offers an easy way of doing it.

AS SELECT Col1, Col2 FROM OldTable;

When we need to duplicate a table structure, this is what we do with SQL Server;

SELECT Col1, Col2
INTO NewTable
FROM OldTable
WHERE 1=0;

and HiveQL facilitates this;


If I am not mistaken, some of these are supported with APS. Not only that HDInsight supports HIVE in all the ways, let's try to understand the usage of HIVE and combining it with SQL Server implementations with future posts.

Monday, June 15, 2015

ETL Data Flow Architectures

ETLing is the way of getting data from one or more sources into your reporting environment or to a repository that holds enterprise-wide data for all reporting and analytical requirements. It is a common and well-known term in data warehousing because a data warehouse solution is impossible to exist without an ETL implementation.

ETL stands for extract, transform and loading, describing the data flow. Generally we extract data from sources, do validations on extracted data, and load the destination, most of the time, destination is a data warehouse. If the ETL solution is very small and less complex, data flow is always from sources to destination without any middle components. However if it is not, the architecture related to data flow is different and it includes additional components. The architecture of the data flow can be determined by considering following;
  • The number of data sources to be accessed.
  • The amount of data to be accessed and transferred in a single execution.
  • The type of loading, whether it is complete refresh or incremental loading.
  • Complexity of validations to be applied (transformations).
  • Data generation frequency at sources.
  • The extraction windows.
  • The accessibility of data sources.
Considering above factors, there are three possible ETL data flow architectures;

Single-stage ETL architecture
The single-stage ETL architecture is suitable for small and less complex data flows. With this architecture, transformation is done at the extraction and in-flight. 

Combining First Name, Middle Name and Last Name into Full Name at the extraction and finding duplicates of Customer Entity are some examples for transformations performed with this architecture.

Two-stage ETL architecture
This data flow architecture maintains an additional environment called Staging. It can be a database (in most cases), set of text files or even Excel files. The staging is used as a temporary container for extracted data and it is implemented based on few factors such as different data acquisition windows, identifying modifications done with previously loaded data and auditing purposes. For more info o staging refer my post: Do we need a staging database for warehousing projects?.

Transformation on extracted data can be done in several stages. It can be performed at the extraction from sources and staging, and in-flight between sources and staging, and between staging and destination.

Three-stage ETL architecture
This extends two-stage architecture by introducing another layer called Landing Zone

This helps to reduce the workload (including transformation at extraction) with the sources. We can extract data as available in the source without adding any complexities for extraction queries, minimizing time it takes for completing the extraction. Then possible transformation can be applied to Landing Zone or later stages. Sometime, sources pushes data (via reports) without letting us to connect with sources. In a scenario like that, Landing Zone is the place where all source files are placed.