Dinesh's Blog :::: Being Compiled ::::: Self-Signed Certificate

Monday, May 23, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

With my previous post Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I, I discussed how to prepare the environment for processing unstructured data using SSIS. With that, I explained the key requirements for this;

Integration Services Feature Pack for Azure
Microsoft Hive ODBC Driver
Self-Signed certificate for adding Azure

Now let's see how we can create a SSIS package for handling the process. Assumption we made with part I is, you have a file that contains unstructured data. Let's say it is something like below;

Let's talk about a simple process for testing this. The above file is the famous file called davinci.txt, that is created with Project Gutenberg and used to demonstrate famous word count big data demo. So the assumption is, you have this file and you need to achieve word count from this file as part of your ETL process. In order to achieve this using SSIS with the help of HDInsight, following have to be done;

Upload the file to Azure Storage
Create the Hadoop Cluster on-demand (you can have it created if you are continuously using it)
Process the file using HiveQL for getting the word counts
Finallay, read the result into local database.

Let's start working on it. For uploading a file to Azure Storage, Azure Blob Upload Task that comes with Integration Services Feature Pack for Azure can be used. All it needs is a connection for the storage.

Create a SSIS project and have a package with a proper name. Drag Azure Blob Upload Task and drop on to Control Flow. Open its editor and create a new connection. New connection dialog box requires Storage account name and Account key. If you have a storage created in your Azure subscription, then access it and get the name and key1. If you do not have a storage, create it and then get them.

This is how you need to set the connection with SSIS.

In addition to the connection, you need to set following items with it;

Blob container - make sure you have a Container created in your storage. I use CloudXplorer for accessing my Azure storage and I can easily create containers and folders using it. You can do it with PowerShell, Visual Studio or any other third-party tool.
Blob directory - Destination. A folder inside the container. This folder is used for storing your file in Azure Storage.
Local directory - Source. Location of the file you keep davinci.txt file.

Once all set, task is ready for uploading files.

Next step is, adding Azure HDInsight Create Cluster task on to Control Flow. Drag it and drop and open the editor for configuring it. This requires Azure Subscription Connection which has to be created with following items;

Azure subscription ID - this can be easily seen with Settings when accessing the subscription via Classic Portal (see Part 1)
The Management certificate thumbprint - this is what we created with Part I and uploaded to Azure. This should be browsed in Local Machine location and My store.

Once the connection is created, you need to set other properties;

Azure Storage Connection - Use the same connection created for upload task.
Location - Use the same location used for the storage
Cluster name - Set a unique name for this. This is your HDI name
Cluster size - set number of nodes you want for your cluster
User name - set the user name of administrator of your HDI.
Password - set a complex password for the user.

Second task in the control flow is ready. Next task is for executing HiveQL query for processing data. I have some posts written on Hive: What is Hive, What is Hive Database, What is Hive Table?, How to create a Hive table and execute queries using HDInsight. Have a look on it if you are new to Hive. Azure HDInsight Hive Task is the one we have to use for processing data using HiveQL. Drag and drop it, and configure like below.

Azure subscription connection - Use the same connection created for above task.
HDInsight cluster name - Use the same name given with previous task.
Local log folder - Set a local folder for saving log files. This is really important for troubleshooting.
Script - You can either set HiveQL as an in-line script or you can have you script in a file saved in a storage, and refer it. I have added the query as an in-line script that does;

Create an external table called Words with one column called text.
Execute a query that aggregates data in Words and insert the result to WordCount table.

DROP TABLE IF EXISTS Words;
CREATE EXTERNAL TABLE Words
(
 text string
) row format delimited 
fields terminated by '\n' 
stored as textfile
location '/inputfiles/';
DROP TABLE IF EXISTS WordCount;
CREATE TABLE WordCount AS
SELECT word, COUNT(*) FROM Words LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

We have added a task for uploading the file (you can upload many files into the same folder), a task for creating the cluster and a task for processing data in added files. Next step is, accessing the table WordCount and get the result into local environment. For this, you need a DataFlow task. Inside the DataFlow, have an ODBC data source for accessing Hive table and a destination as you prefer.

Let's configure ODBC source. Drag and drop and set properties as below.

ODBC connection manager - Create a new connection using Hive ODBC connection created with Part I.
Data access mode - Select Table Name as HiveQL stores the resultset into a table called WordCount.
Name of the table or view - Select WordCount table from the drop-down.
Columns - Make sure it detects columns like below. Rename them as you want.

Note that you cannot configure the source if you have not created and populated Hive table. Therefore, before adding the DataFlow task, execute first three control flow tasks that upload the file, create the cluster, process data and save data into the table. Then configure the DataFlow task as above.

Add a Data Reader destination for testing this. You can add any type of transformations if you need to transform data further and send to any type of destination. Enable Data Viewer for seeing data.

Now you need to go back to Control Flow and add one more task for deleting the cluster you added. For that, drag and drop Azure HDInsight Delete Cluster Task and configure just like the way you configure other Azure tasks.

That is all. Now if you run the package, it will upload the file, create the cluster, process the file , get data into local environment, and delete the cluster as the final task.

This is how you use SSIS for processing unstructured data with the support of HDI. You need to know that creating HDI on-demand takes long time (than I expected, already checked with experts, waiting for a solution). Therefore you may create the cluster and keep it in Azure if the cost is not an issue.

Saturday, May 21, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

A fully-fledged Business Intelligence system never ignore unstructured data. The reason is, you can never get the true insight without considering, consuming and processing all types of data available in an organization. If you design a BI solution and if you have both structured and unstructured data, how do you plan to process them?

Generally, processing unstructured data is still belong to Hadoop ecosystem. It is designed for that, and it is always better to handover the responsibility to Hadoop. We do BI using Microsoft SQL Server product suite, and SQL Server Integration Services (SSIS) is the component we use for handling ETLing. If there is an unstructured data set that needs to be processed as a part of ETL process, how can you get the support from Hadoop via SSIS for processing unstructured data and getting them back as structured data? The solution is given with Integration Services Feature Pack for Azure.

The Integration Services Feature Pack for Azure (download from here) provides us functionalities for connecting with Azure Storage and HDInsight for transferring data between Azure Storage and On-Premise data sources, and processing data using HDInsight. Once installed, you see newly added tasks in both Control Flow and Data Flow as below;

Assume that you have a file which is unstructured and it needs to be processed. And once processed you need to get the result back to your data warehouse. For implementing this using SSIS, you need to do following;

Install Integration Services Feature Pack for Azure (download from here)
Install Microsoft Hive ODBC Driver (download from here)
Generate a Self-Signed Certificate and upload to Azure Subscription.

Why we need Microsoft Hive ODBC Driver? We need this for connecting with Hive using ODBC. There are two drivers; one for 32-bit applications and other is for 64-bit applications. In order to use it with Visual Studio, you need 32-bit driver. It is always better to have both installed and when creating DSN, create the same DSN in both 32-bit and 64-bit System DSN. This is how you have to configure DSN for Hive in Azure HDInsight.

Open ODBC Source Administrator (32-bit) application and click Add button in System DSN tab.

Select the driver as Microsoft Hive ODBC Driver and configure it as below;

You can click on Test button for testing the connection. If you have not configured the cluster yet (in my case, it is not, we will be creating using SSIS), you will get an error. But still you can save it keep it. Note that it is always better to create another DSN using 64-bit ODBC Data Source Administrator with the same name.

Next step is, generating the certificate and add it Azure subscription. This is required for connecting to Azure using Visual Studio. Easiest way of doing this is, creating the certificate using Internet Information Services (IIS), export it using certmgr and upload it using Azure Classic Portal. Let's do it.

Open IIS Manager and go for Server Certificates. Click on Create Self-Signed Certificate for generating a certificate.

Once created, open Manage Computer Certificates and start Export Wizard.

The wizard starts with welcome page and then open Export Private Key page. Select No, do not export private key option and continue.

Select Base-64 Encoded X.509 (.CER) option in Export File Format and continue.

Next page asks you the file name for the certificate. Type the same name with the location you need to save it and complete the wizard.

Certificate is exported with the required format. Now you need to upload it to Azure Subscription. Open Azure New Portal and then open Azure Classic Portal. Scroll down the items and fins Settings, and click on it.

Settings page has Manage Certificate section. Click on it for opening it and upload the certificate created.

Everything needs by SSIS for working with Azure Storage and HDI is done. Now let's make the SSIS package to upload an unstructured data, create a HDI cluster on-demand, process uploaded file using Hive, download the result, and then finally remove the cluster because you do not need to pay extra to Microsoft.

Since the post it bit lengthy, let me make the next part as a new post. Read it from below link:
Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II