Dinesh's Blog :::: Being Compiled ::::: SQL Server 2016 features: Polybase: SQL Server connector to Hadoop

** The latest article on this based on RC3 is available as SQL Server 2016 - Manipulating data in Azure Storage using PolyBase.

Polybase is not a new product to Microsoft SQL Server product suite but it is new to many of us or it is not a well known component to developers as it was limited to one edition. Microsoft started adding Big Data related components to Microsoft platform long time back and as a result of it, HDInsight (Hadoop in Azure) and Analytical Platform System (Parallel Data Warehouse) appliance were introduced. Poybase was part of APS and it was limited to APS edition. In a way, it acted as a bridge between Hadoop and relational databases, allowing us to query files (all structured, semi-structured and unstructured) in the Hadoop using familiar language: SQL.

Good news is, now Microsoft has included this functionality to other editions as part of Microsoft SQL Server 2016. It allows you to process data in files (obviously large files) using SQL as they were tables in the database when files are stored in Azure Blob Storage or Hadoop.

As you know, SQL Server 2016 is still CTP and it is CTP 2.1 when this post is written. This post shows how to use Polybase when text files are stored in Azure Hadoop. Note that the code written is based on current version and implementation may be changed with future releases.

How do you start? First of all, you need SQL Server 2016 installed in your machine. Here are the steps for installing it: Installing SQL Server 2016 CTP 2.1 - Step by Step and Issues faced

Once installed, you need to enable hadoop connectivity using sp_configure. If you run sp_configure, you will see that zero is set with it.

Since I am going to access Hadoop configured in Azure, it needs a value like 4. For more information on values can be set, read this: https://msdn.microsoft.com/en-us/library/mt143174.aspx

Once set, restart SQL Server Service and Hadoop related services (shown in the image below). Note that those services are not visible via SQL Server Configuration Manager.

In order to test Polybase, we need a HDInsight cluster configured with Azure. If have already created one, make a note on storage name and primary key of it. If you have not created, have a look on this post for creating one and accessing it via CloudXplorer: How to navigate HDInsight cluster easily: CloudXplorer

Let's create a simple text file for testing and place it in HDInsight. I have created a simple text file (of course, it is too small and not a best example for Hadoop, but will use it for testing) and uploaded to a new folder called testfolder in HDInsight using CloudXplorer.

Here are the next steps:

1. Enable 4631 flag using TRACEON and create a Credential for accessing Azure storage.

-- create a database for testing
CREATE DATABASE TestPolybase;
GO

-- connect with it
USE TestPolybase;
GO

-- enable 4631
DBCC TRACEON(4631, -1);

-- create the master key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Pa$$w0rd';

-- create credential for the azure storage
-- Note that even though this creates it
-- I had an error with CREATE EXTERNAL DATA SOURCE
-- hence, did not use this
-- changed core-site.xml instead

--CREATE CREDENTIAL Azure01DineshCredential
--WITH IDENTITY = 'azure01dineshhdistorage' -- storage name
-- , SECRET = 'abc23232njjj'; -- primary key

However, this did not work for me. CREATE CREDENTIAL allowed me to create it but I could not use it with CREATE EXTERNAL DATA SOURCE. It threw following error with it;

Msg 46516, Level 16, State 26, Line 16
The specified credential cannot be found

The reason for this could be, not adding credential to the database but to the server. However I could not add it to the database too;

Therefore, I added the credential manually to core-ste.xml file. File is located generally in:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf

For more info on this, read this: https://msdn.microsoft.com/en-us/library/ms189522.aspx

2. Then we need to create an External Data Source. This sets the location we need to access for querying files. Note that I have not used CREDENTIAL keyword and LOCATION has been set with storage name and container name.

-- create the external data source
-- note that CREDENTIAL has not been used
CREATE EXTERNAL DATA SOURCE Azure01DineshHDIstorage
WITH
(
 TYPE = HADOOP,
 -- wasbs://storagename@containername.blob.core.windows.net/
 LOCATION = 'wasbs://azure01dineshhdinsight@azure01dineshhdistorage.blob.core.windows.net/'
 --, CREDENTIAL = Azure01DineshCredential
);

For more info on this, read this: https://msdn.microsoft.com/en-us/library/dn935022.aspx

3. Next step is, creating a File Format. This helps to understand the file we have stored when reading the file as a table.

-- create the file format required
CREATE EXTERNAL FILE FORMAT FileFormat
WITH
(
 FORMAT_TYPE = DELIMITEDTEXT
 , FORMAT_OPTIONS
  (
   FIELD_TERMINATOR = ','
  )
);

For more info on this, read: https://msdn.microsoft.com/en-us/library/dn935026.aspx

4. Now we need to create an External Table that references data stored in the location we specify. Here is the code for creating the table. Since I have stored the Log.txt file in testfolder created in HDInsight, Location is set with '/testfolder'. Data_Source is set with the one I created and File_Format is set with the format created. Reject_type and Reject_value indicate that ignore 10 invalid records.

-- create the table using file format created
-- and for the file uploaded
CREATE EXTERNAL TABLE LogData
(
 id int
 , userid int
 , nofseconds int
)
WITH
(
 LOCATION = '/testfolder'
 , DATA_SOURCE = Azure01DineshHDIstorage
 , FILE_FORMAT = FileFormat
 , REJECT_TYPE = VALUE
 , REJECT_VALUE = 10
);

For more info on this, read: https://msdn.microsoft.com/en-us/library/dn935021.aspx

All done. If you go through Object Explorer, you should see all we have created.