Thursday, April 21, 2016

SQL Server 2016 - Manipulating data in Azure Storage using PolyBase

When PolyBase is opened with SQL Server editions other than APS or PDW, I tested it with SQL Server 2016 CTP 2.1. It had some issues but was able to get it worked with some workaround. Thought to do the same with RC3 since there are many enhancements, and it works fine now. If you are interested in Polybase, have a look on the post I have written in June 2015; SQL Server 2016 features: Polybase: SQL Server connector to Hadoop.

What is PolyBase? It is a feature that is available with SQL Server 2016. It facilitates us to use TSQL against data stored in Hadoop or Azure Blob Storage, allowing us to combine structured data with semi-structured data. Data warehousing benefits a lot from this as it reduces the time spending on ETLing and supports real-time analysis. Other than that it can be used with OLTP databases and can be used for archiving data as well.

Currently this feature is available with SQL Server 2016 RC3, I did not come across Edition Comparison related to 2016, hence no idea which Edition will support with final release. Note that there are few prerequisites like .Net framework 4.5 or later, Oracle Java SE runtime. Make sure you have installed all required before installing SQL Server and enabling Polybase. You can read a post I wrote on installation SQL Server RC3: Installing SQL Server 2016 RC1 - Sharing issues.

Generally, we should install PolyBase feature as Standard-alone Polybase. If you want to make a collection of SQL Server instances as a PolyBase Scale-Out Group, then select PolyBase Scale-Out Group during the installation.

PolyBase support two Hadoop providers: Hortonwork’s Data Platform (HDP) and Cloudera’s CDH either on Linux or Windows. In addition to that it can connect with Azure Blob Storage too.

Let's start with a simple one. I have uploaded a file to my Azure Storage. This file holds small number of CDR records and let's see how we can read these records using TSQL.



In order to access this file, we need to make sure PolyBase is configured properly. Make sure two new services: SQL Server PolyBase Data Movement and SQL Server PolyBase Engine are running. In addition to that, make sure that it is enabled and connectivity is set as below.


For understanding the value that needs to be set with Hadoop Connectivity, read this.

Let's create a database and configure Data Source. Note that, we need to create a Database Credential first for using it with the Data Source.

CREATE DATABASE Telco;
GO

USE Telco;
GO

-- Create a master key to encrypt credential secret.
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Pa$$w0rd';

-- Create a database scoped credential (Azure Storage).
-- Enter your azure storage primary key as the secret.
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH 
        IDENTITY = 'AzureStorageUser'
  , Secret = 'primary key';

-- create the external data source
CREATE EXTERNAL DATA SOURCE DinesqlStorage
WITH
(
 TYPE = HADOOP,
 LOCATION = 'wasbs://cdrcontainer@dinesqlstorage.blob.core.windows.net/'
 , CREDENTIAL = AzureStorageCredential
);


Next step is creating a File Format for the file we gonna read. PolyBase supports Delimited Text, Hive RCFile, Hive ORC, and Parquet. Let's create the File Format for our file.

-- Create an external file format (delimited text file).
CREATE EXTERNAL FILE FORMAT CDRFileFormat WITH (
        FORMAT_TYPE = DELIMITEDTEXT, 
        FORMAT_OPTIONS (FIELD_TERMINATOR =';', 
                USE_TYPE_DEFAULT = False)
);


Last step is creating an External Table matching with the file uploaded combining the File Format. Read more on it with CREATE EXTERNAL TABLE. Here is the code I used.

-- create the table using file format created
-- and for the file uploaded
CREATE EXTERNAL TABLE CDRData
(
 MSIDN nvarchar(100)
 , IMSI nvarchar(100)
 , IMEI nvarchar(100)
 , [Plan] nvarchar(10)
 , CALL_TYPE nvarchar(20)
 , CORRESP_TYPE nvarchar(10)
 , CORRESP_ISDN nvarchar(100)
 , Duration int
 , [Time] nvarchar(20)
 , [Date] nvarchar(20)
 )
WITH
(
 LOCATION = '/cdrfiles'
 , DATA_SOURCE = DinesqlStorage
 , FILE_FORMAT = CDRFileFormat
);

Now I can simply query data in my Azure Storage using the table I created. Not only that I can join this dataset with my other tables and do more analysis. I will add more complex examples with next set of posts.


No comments: