Dinesh's Blog :::: Being Compiled ::::: HDInsight

Showing posts with label HDInsight. Show all posts

Monday, October 2, 2017

Introduction to Azure HDInsight

I have written few posts on HDInsight and thought to make series of Video on it as well. Here is the first one, with this video, I discuss, what is Hadoop, What is HDInsight, Configuration of HDInsight, Important Settings related to HDInsight and how to access the cluster using Azure Portal and PuTTY.

Watch and see!

Tuesday, September 12, 2017

Run Hive Queries using Visual Studio

Once HDInsight cluster is configured, we generally use either the portal dashboard (Powered by Ambari) or a tool like PuTTY for executing queries against data loaded. Although they are not exactly a developer related tools, or in other words, not an IDE, we had to use because we did not have much options. However, now we can use the IDE we have been using for years for connecting with HDInsight and executing various types of queries such as Hive, Pig and USQL. It is Visual Studio.

Let's see how we can use Visual Studio for accessing HDInsight.

Making Visual Studio read for HDInsight

In order to work with HDInsight using Visual Studio, you need to install few tools on Visual Studio. Here are the supported versions;

Visual Studio 2013 Community/Professional/Premium/Ultimate with Update 4
Visual Studio 2015 any edition
Visual Studio 2017 any edition

You need to make sure that you have installed Azure SDK on your Visual Studio. Click here for downloading the Web Platform Installer and make sure following are installed;

This installs Microsoft Azure Data Lake Tools for Visual Studio as well, make sure it is installed.

Now your Visual Studio is ready for accessing HDInsight.

Connecting with HDInsight

Good thing is, you can connect with your cluster even without creating a project. However, once the SDK is installed, you can see new Templates called Azure Data Lake - HIVE (HDInsight), Pig (HDInsight), Storm (HDInsight) and USQL (ADLA) and HIVE template can be used for creating a project.

Project creates one hql file for you and you can use it from executing your Hive Queries. In addition to that, You can open Server Explorer (View Menu -> Server Explorer), and expand Azure (or connect to your Azure account and then expand) for seeing all components related to Azure.

As you see, you can see all databases, internal and external tables, views and columns. Not only that, by right-clicking the cluster, you can open a windows for writing a query or viewing jobs. Here is the screen when I use the first option that is Write a Hive Query.

Did you notice Intelli-Sense? Yes, it supports with almost all metadata, hence it is really easy to write a query.

Executing Queries

If you need to see records in tables without limiting data with predicates or constructing the query with additional functions, you can simply right-click on the table in Server Explorer and select View top 100 Rows.

If you need to construct a query, then use the above method for opening a window and write the query. There are two ways of executing the code: Batch and Interactive. Batch mode does not give you the result immediately but you will be able to see or download once the job submitted is completed. If you use the Interactive, then it is similar to SSMS result.

If you use the Batch mode, you can see the way job is getting executed. Once the job is completed, you can click on Job Output for seeing or downloading the output.

As you see, there is no graphical interface to see the job execution. Visual Studio will show the job execution using a graphical interface only when the job is executed by Tez Engine. Remember, HDInsight will always use Tez Engine to execute Hive Queries but simpler queries will be executed using Map Reduce Engine.

See this query that has some computation;

Can we create table with this IDE?

Yes, it is possible. You can right-click on the your database in Azure Server Explorer and select Create table menu item.

Let's talk about more on this with later posts.

Sunday, September 10, 2017

How to refer files in HDInsight - Azure Storage using different ways

If you have started working with Big Data, you surely need to check the Microsoft support on it via Azure platform - HDInsight service. HDInsight allows you to create a Hadoop environment within few minutes and it can be anytime scaled out or in based on your requirements. I have written few posts on this, you can have a look on them using following links;

Creating HDInsight Hadoop Cluster with Integration Services (SSIS) Azure HDInsight Create Cluster Task

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

How to Connect HDInsight using Putty

Configuring HDInsight Hadoop Cluster and running a sample MapReduce Job

SQL Server 2016 features: Polybase: SQL Server connector to Hadoop

In order to work with data loaded to HDInsight, or Hadoop, data files have to be refereed using supported syntax. There are multiple ways for referring files in the storage with HDFS. Here are the ways;

Fully qualified path with wasb(s) protocol

This is most accurate and correct way of referring files in the storage. Here is the pattern;

wasb://container_name@storage_account_name.blob.core.windows.net/file_path.

Here is an example using Putty, connecting with HDInsight and reading a file (processed with Hive) exist. My container name is dinesqlashdinsight and storage name is dinesqlasstorage. File path is data/cleanlog/000000_0 (this is a Hive table in fact).

Connecting with the default container

If your files are in the default container, you can skip the container name and storage name as follow;

wasb:///file_path

Note the three slashes. It is required when you do not mentioned the container name.

Connecting using Hadoop/Linux/Unix native ways

Generally, when you work with Hadoop using Linux/Unix, you refer files without the protocol. Azure HDInsight supports the same and we can refer files using that syntax.

file_path

Do I need double quotes for my paths?

It is required when you have some odd characters like equal (=) sign with your path. See the example below. I try to read a data file exist in a the cluster and the path has equal signs, hence path is encased with double quotes.

Wednesday, September 6, 2017

How to Connect HDInsight using Putty

I have written many articles on HDInsight Service which is an Azure service that allows us to provision Apache Hadoop Cluster and use it for various type of big data related implementations. If you need to see all my articles written on HDInsight, please use the link: http://dinesql.blogspot.com/search?q=hdinsight.

Although I have mainly used Azure portal to demonstrate HDInsight (Ambari dashboard), many use a tool called PuTTY that is a free implementation of SSH and Telnet for Windows and Unix platform. It is a good tool to work with, specifically if you need to connect the cluster remotely and work with it for navigating nodes or doing some data work with some sub projects like Hive against the cluster. Remember, if you have configured your HDInsight cluster as a Linux-based cluster, you need a SSH client to remotely connect with it. Let's see the way of using it.

First of all you need to download PuTTY for using it as Windows machines do not have it as a built-in tool (Linux and Mac OS computers have a SSH client interface built-in). Here is the link for downloading PuTTY.

https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

There is no installation with it. You can just use the putty.exe. However, you need a host name to connect. If you go to the portal and open your cluster blade, you can get the Host Name for connecting with a SSH Client by clicking Secure Shell (SSH) button.

Note that there are two, one for Windows and another for Linux, Unix and OS X users, make sure that you copy the Windows one.

When you open PuTTY, you should see an input box called Host Name for entering the copied host name. Enter the copied value and make sure that the Connection Type is selected as SSH.

All you have to do now is, click Open. This opens the standard window with standard font and font size. If you need a larger font, you can go to Window - Appearance and change settings as you want.

When you open this first time, you will get a security warning as below;

This says that we do not have the key related to the host cached locally, hence it is not secure or not sure whether we connect to the right one. Since we know that we connect to the right host, we can simply click Yes and continue.

Once connected, you will be promoted for user id and password. Note that you need to use the one you configured as SSHUser and the Password when the cluster was created. It does not accept the HTTP User.

Enter the SSHUser and the password, you will be connected with the default node.

Everything is ready now. As you see, first part of the last line is the user connected and hn0 is one of the head nodes. Last part of it shows the cluster name but we see it partially as it is lengthy. Now I can navigate, make folders, move files or connect with Hive without any issue.

Monday, March 6, 2017

Creating HDInsight Hadoop Cluster with Integration Services (SSIS) Azure HDInsight Create Cluster Task

I have written two posts on HDInsight + SSIS based on Azure Feature Pack and they speak about how to creating a HDInsight Cluster and processing semi-structured data using Hive. You can read them at;

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

With the latest release of Azure Feature Pack, some of the settings have been changed and the creating is easier with new task than the previous one. This posts discusses the way of configuring Azure HDInsight Create Cluster Task.

In order to create a HDInsight Hadoop Cluster using SSIS, you need to make sure that following steps have been done;

Azure Feature Pack is installed.
Azure Storage is created and you have the name and key of it.
Azure Active Directory Application is created and linked with the subscription used.

Since the installation of Azure Feature Pack is straightforward, we do not need to discuss it here. Again, creating a Storage is also a very common task with almost all Azure related implementation, hence it is not going to be discussed too. Let's start with Active Directory Application creation.

Creating an Active Directory Application and link to the subscription

Here are the steps for creating it.

1. Login to the portal and open the Azure Active Directory blade.

2. Click on App Registration and click Add.

3. In the Create Blade, enter the Name of the app and select the Application Type as Web app / API. Enter the Sign-on URL as you want. It can be changed later as your need, hence enter something and continue even though the domain is not registered with the Azure.

4. Once the application is created, you need to take the Application ID copied.

5. You need to create an Application Key and take a copy of too. Go to Keys under Settings and create one. Note the message appeared. It is better to get a copy of the key before closing the blade.

6. You need the Tenant Id for creating the cluster. Go to Properties of Active Directory Blade and get the Directory Id. It is the Tenant Id.

7. You need two more values to be copied and kept. Go to Subscription Blade and take a copy of your Subscription Id. Other one is resource group used. Take copy of its name too. You need to make sure everything is under one resource group.

8. Next is assigning or linking the application created with your subscription. Open your Subscription Blade and select the subscription you need to use. Then select Access Control (IAM) and click on Add button.

9. With the Add Access Blade, select the Role you need to set with the Application. You need to make sure that you select the right role for this.

10. With the Add Users Blade, search for the Application Name and select it.

At the end of this steps, you have;

Application Id
Application Key
Tenant Id
Subscription Id
Resource Group Name

Let's start with SSIS package now. Here are the steps for configuring it.

Creating a SSIS Package and configuring Azure HDInsight Create Cluster Task

Here are the steps doing it.

1. Create a SSIS project and a package. Drag and Drop Azure HDInsight Create Cluster Task. Open the properties of the task.

2. Open the properties of the task and select < New Connection...> of AzureResourceManagerConnection. Create a connection using copied items: Application Id, Application Key, and Tenant Id.

3. Select < New Connection...> for AzureStorageConnection. Enter the Storage account name and Account key that will be used by the cluster.

3. Next, fill the following;

Subscription Id - enter the copied one
ResourceGroup - enter the resource group used for the storage.
Location - Select the location used for creating the storage. Everything should be in the same locaiton.
ClusterName - enter the name of the cluster you need.
ClusterSize - enter the number of Worker Nodes you need in your cluster.
BlobContainer - enter the name of the Container that should be created for holding cluster files in the storage.
Username and Password - enter the user name and password for the cluster.
SshUserName and Password - enter the user name and the password for accessing the cluster remotely.

Everything required is done. If you execute the package or task now, it will create the cluster for you.

Genrally it takes about 10-15 minutes. You can open the portal and verify once it is created.

Remember, once this is created, whether you use it or not, you will be charged. Therefore, it is better to delete the cluster once the operation required is completed with it. You can use Azure HDInsight Delete Cluster Task for that. Here is the settings of it.

Saturday, February 11, 2017

Azure Feature Pack for Integration Services (SSIS) - New Version Available

Microsoft has made the latest version of Azure Feature Pack for Integration Services (SSIS) and it is available for downloading. This was released on 28-Dec-2016 and there are few addition components added when comparing with the previous version.

Why we need this? Simply to work with Big Data as part of our ETL solution with the support of Azure Big Data platform which is HDInsight and supportive projects.

I wrote few posts on previous version of this, you can read more on them at;

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

The above posts explain how to download and install the feature pack, configure tasks for accessing HDInsight and process some unstructured data using tasks given with SSIS.

You can get the latest version of Azure Feature Pack from: https://msdn.microsoft.com/en-us/library/mt146770.aspx

I noticed one Control Flow new Task; Azure SQL DW Upload Task and two Data Flow items; Azure Data Lake Store Source and Azure Data Lake Store Destination that were not available with the previous version.

Previous Version:

Latest Version:

Need to test these new items, must be really interesting, will surely share them via blog posts.

Monday, May 23, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

With my previous post Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I, I discussed how to prepare the environment for processing unstructured data using SSIS. With that, I explained the key requirements for this;

Integration Services Feature Pack for Azure
Microsoft Hive ODBC Driver
Self-Signed certificate for adding Azure

Now let's see how we can create a SSIS package for handling the process. Assumption we made with part I is, you have a file that contains unstructured data. Let's say it is something like below;

Let's talk about a simple process for testing this. The above file is the famous file called davinci.txt, that is created with Project Gutenberg and used to demonstrate famous word count big data demo. So the assumption is, you have this file and you need to achieve word count from this file as part of your ETL process. In order to achieve this using SSIS with the help of HDInsight, following have to be done;

Upload the file to Azure Storage
Create the Hadoop Cluster on-demand (you can have it created if you are continuously using it)
Process the file using HiveQL for getting the word counts
Finallay, read the result into local database.

Let's start working on it. For uploading a file to Azure Storage, Azure Blob Upload Task that comes with Integration Services Feature Pack for Azure can be used. All it needs is a connection for the storage.

Create a SSIS project and have a package with a proper name. Drag Azure Blob Upload Task and drop on to Control Flow. Open its editor and create a new connection. New connection dialog box requires Storage account name and Account key. If you have a storage created in your Azure subscription, then access it and get the name and key1. If you do not have a storage, create it and then get them.

This is how you need to set the connection with SSIS.

In addition to the connection, you need to set following items with it;

Blob container - make sure you have a Container created in your storage. I use CloudXplorer for accessing my Azure storage and I can easily create containers and folders using it. You can do it with PowerShell, Visual Studio or any other third-party tool.
Blob directory - Destination. A folder inside the container. This folder is used for storing your file in Azure Storage.
Local directory - Source. Location of the file you keep davinci.txt file.

Once all set, task is ready for uploading files.

Next step is, adding Azure HDInsight Create Cluster task on to Control Flow. Drag it and drop and open the editor for configuring it. This requires Azure Subscription Connection which has to be created with following items;

Azure subscription ID - this can be easily seen with Settings when accessing the subscription via Classic Portal (see Part 1)
The Management certificate thumbprint - this is what we created with Part I and uploaded to Azure. This should be browsed in Local Machine location and My store.

Once the connection is created, you need to set other properties;

Azure Storage Connection - Use the same connection created for upload task.
Location - Use the same location used for the storage
Cluster name - Set a unique name for this. This is your HDI name
Cluster size - set number of nodes you want for your cluster
User name - set the user name of administrator of your HDI.
Password - set a complex password for the user.

Second task in the control flow is ready. Next task is for executing HiveQL query for processing data. I have some posts written on Hive: What is Hive, What is Hive Database, What is Hive Table?, How to create a Hive table and execute queries using HDInsight. Have a look on it if you are new to Hive. Azure HDInsight Hive Task is the one we have to use for processing data using HiveQL. Drag and drop it, and configure like below.

Azure subscription connection - Use the same connection created for above task.
HDInsight cluster name - Use the same name given with previous task.
Local log folder - Set a local folder for saving log files. This is really important for troubleshooting.
Script - You can either set HiveQL as an in-line script or you can have you script in a file saved in a storage, and refer it. I have added the query as an in-line script that does;

Create an external table called Words with one column called text.
Execute a query that aggregates data in Words and insert the result to WordCount table.

DROP TABLE IF EXISTS Words;
CREATE EXTERNAL TABLE Words
(
 text string
) row format delimited 
fields terminated by '\n' 
stored as textfile
location '/inputfiles/';
DROP TABLE IF EXISTS WordCount;
CREATE TABLE WordCount AS
SELECT word, COUNT(*) FROM Words LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

We have added a task for uploading the file (you can upload many files into the same folder), a task for creating the cluster and a task for processing data in added files. Next step is, accessing the table WordCount and get the result into local environment. For this, you need a DataFlow task. Inside the DataFlow, have an ODBC data source for accessing Hive table and a destination as you prefer.

Let's configure ODBC source. Drag and drop and set properties as below.

ODBC connection manager - Create a new connection using Hive ODBC connection created with Part I.
Data access mode - Select Table Name as HiveQL stores the resultset into a table called WordCount.
Name of the table or view - Select WordCount table from the drop-down.
Columns - Make sure it detects columns like below. Rename them as you want.

Note that you cannot configure the source if you have not created and populated Hive table. Therefore, before adding the DataFlow task, execute first three control flow tasks that upload the file, create the cluster, process data and save data into the table. Then configure the DataFlow task as above.

Add a Data Reader destination for testing this. You can add any type of transformations if you need to transform data further and send to any type of destination. Enable Data Viewer for seeing data.

Now you need to go back to Control Flow and add one more task for deleting the cluster you added. For that, drag and drop Azure HDInsight Delete Cluster Task and configure just like the way you configure other Azure tasks.

That is all. Now if you run the package, it will upload the file, create the cluster, process the file , get data into local environment, and delete the cluster as the final task.

This is how you use SSIS for processing unstructured data with the support of HDI. You need to know that creating HDI on-demand takes long time (than I expected, already checked with experts, waiting for a solution). Therefore you may create the cluster and keep it in Azure if the cost is not an issue.

Saturday, May 21, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

A fully-fledged Business Intelligence system never ignore unstructured data. The reason is, you can never get the true insight without considering, consuming and processing all types of data available in an organization. If you design a BI solution and if you have both structured and unstructured data, how do you plan to process them?

Generally, processing unstructured data is still belong to Hadoop ecosystem. It is designed for that, and it is always better to handover the responsibility to Hadoop. We do BI using Microsoft SQL Server product suite, and SQL Server Integration Services (SSIS) is the component we use for handling ETLing. If there is an unstructured data set that needs to be processed as a part of ETL process, how can you get the support from Hadoop via SSIS for processing unstructured data and getting them back as structured data? The solution is given with Integration Services Feature Pack for Azure.

The Integration Services Feature Pack for Azure (download from here) provides us functionalities for connecting with Azure Storage and HDInsight for transferring data between Azure Storage and On-Premise data sources, and processing data using HDInsight. Once installed, you see newly added tasks in both Control Flow and Data Flow as below;

Assume that you have a file which is unstructured and it needs to be processed. And once processed you need to get the result back to your data warehouse. For implementing this using SSIS, you need to do following;

Install Integration Services Feature Pack for Azure (download from here)
Install Microsoft Hive ODBC Driver (download from here)
Generate a Self-Signed Certificate and upload to Azure Subscription.

Why we need Microsoft Hive ODBC Driver? We need this for connecting with Hive using ODBC. There are two drivers; one for 32-bit applications and other is for 64-bit applications. In order to use it with Visual Studio, you need 32-bit driver. It is always better to have both installed and when creating DSN, create the same DSN in both 32-bit and 64-bit System DSN. This is how you have to configure DSN for Hive in Azure HDInsight.

Open ODBC Source Administrator (32-bit) application and click Add button in System DSN tab.

Select the driver as Microsoft Hive ODBC Driver and configure it as below;

You can click on Test button for testing the connection. If you have not configured the cluster yet (in my case, it is not, we will be creating using SSIS), you will get an error. But still you can save it keep it. Note that it is always better to create another DSN using 64-bit ODBC Data Source Administrator with the same name.

Next step is, generating the certificate and add it Azure subscription. This is required for connecting to Azure using Visual Studio. Easiest way of doing this is, creating the certificate using Internet Information Services (IIS), export it using certmgr and upload it using Azure Classic Portal. Let's do it.

Open IIS Manager and go for Server Certificates. Click on Create Self-Signed Certificate for generating a certificate.

Once created, open Manage Computer Certificates and start Export Wizard.

The wizard starts with welcome page and then open Export Private Key page. Select No, do not export private key option and continue.

Select Base-64 Encoded X.509 (.CER) option in Export File Format and continue.

Next page asks you the file name for the certificate. Type the same name with the location you need to save it and complete the wizard.

Certificate is exported with the required format. Now you need to upload it to Azure Subscription. Open Azure New Portal and then open Azure Classic Portal. Scroll down the items and fins Settings, and click on it.

Settings page has Manage Certificate section. Click on it for opening it and upload the certificate created.

Everything needs by SSIS for working with Azure Storage and HDI is done. Now let's make the SSIS package to upload an unstructured data, create a HDI cluster on-demand, process uploaded file using Hive, download the result, and then finally remove the cluster because you do not need to pay extra to Microsoft.

Since the post it bit lengthy, let me make the next part as a new post. Read it from below link:
Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II