Dinesh's Blog :::: Being Compiled ::::: HIVE

Showing posts with label HIVE. Show all posts

Tuesday, September 12, 2017

Run Hive Queries using Visual Studio

Once HDInsight cluster is configured, we generally use either the portal dashboard (Powered by Ambari) or a tool like PuTTY for executing queries against data loaded. Although they are not exactly a developer related tools, or in other words, not an IDE, we had to use because we did not have much options. However, now we can use the IDE we have been using for years for connecting with HDInsight and executing various types of queries such as Hive, Pig and USQL. It is Visual Studio.

Let's see how we can use Visual Studio for accessing HDInsight.

Making Visual Studio read for HDInsight

In order to work with HDInsight using Visual Studio, you need to install few tools on Visual Studio. Here are the supported versions;

Visual Studio 2013 Community/Professional/Premium/Ultimate with Update 4
Visual Studio 2015 any edition
Visual Studio 2017 any edition

You need to make sure that you have installed Azure SDK on your Visual Studio. Click here for downloading the Web Platform Installer and make sure following are installed;

This installs Microsoft Azure Data Lake Tools for Visual Studio as well, make sure it is installed.

Now your Visual Studio is ready for accessing HDInsight.

Connecting with HDInsight

Good thing is, you can connect with your cluster even without creating a project. However, once the SDK is installed, you can see new Templates called Azure Data Lake - HIVE (HDInsight), Pig (HDInsight), Storm (HDInsight) and USQL (ADLA) and HIVE template can be used for creating a project.

Project creates one hql file for you and you can use it from executing your Hive Queries. In addition to that, You can open Server Explorer (View Menu -> Server Explorer), and expand Azure (or connect to your Azure account and then expand) for seeing all components related to Azure.

As you see, you can see all databases, internal and external tables, views and columns. Not only that, by right-clicking the cluster, you can open a windows for writing a query or viewing jobs. Here is the screen when I use the first option that is Write a Hive Query.

Did you notice Intelli-Sense? Yes, it supports with almost all metadata, hence it is really easy to write a query.

Executing Queries

If you need to see records in tables without limiting data with predicates or constructing the query with additional functions, you can simply right-click on the table in Server Explorer and select View top 100 Rows.

If you need to construct a query, then use the above method for opening a window and write the query. There are two ways of executing the code: Batch and Interactive. Batch mode does not give you the result immediately but you will be able to see or download once the job submitted is completed. If you use the Interactive, then it is similar to SSMS result.

If you use the Batch mode, you can see the way job is getting executed. Once the job is completed, you can click on Job Output for seeing or downloading the output.

As you see, there is no graphical interface to see the job execution. Visual Studio will show the job execution using a graphical interface only when the job is executed by Tez Engine. Remember, HDInsight will always use Tez Engine to execute Hive Queries but simpler queries will be executed using Map Reduce Engine.

See this query that has some computation;

Can we create table with this IDE?

Yes, it is possible. You can right-click on the your database in Azure Server Explorer and select Create table menu item.

Let's talk about more on this with later posts.

Monday, May 23, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

With my previous post Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I, I discussed how to prepare the environment for processing unstructured data using SSIS. With that, I explained the key requirements for this;

Integration Services Feature Pack for Azure
Microsoft Hive ODBC Driver
Self-Signed certificate for adding Azure

Now let's see how we can create a SSIS package for handling the process. Assumption we made with part I is, you have a file that contains unstructured data. Let's say it is something like below;

Let's talk about a simple process for testing this. The above file is the famous file called davinci.txt, that is created with Project Gutenberg and used to demonstrate famous word count big data demo. So the assumption is, you have this file and you need to achieve word count from this file as part of your ETL process. In order to achieve this using SSIS with the help of HDInsight, following have to be done;

Upload the file to Azure Storage
Create the Hadoop Cluster on-demand (you can have it created if you are continuously using it)
Process the file using HiveQL for getting the word counts
Finallay, read the result into local database.

Let's start working on it. For uploading a file to Azure Storage, Azure Blob Upload Task that comes with Integration Services Feature Pack for Azure can be used. All it needs is a connection for the storage.

Create a SSIS project and have a package with a proper name. Drag Azure Blob Upload Task and drop on to Control Flow. Open its editor and create a new connection. New connection dialog box requires Storage account name and Account key. If you have a storage created in your Azure subscription, then access it and get the name and key1. If you do not have a storage, create it and then get them.

This is how you need to set the connection with SSIS.

In addition to the connection, you need to set following items with it;

Blob container - make sure you have a Container created in your storage. I use CloudXplorer for accessing my Azure storage and I can easily create containers and folders using it. You can do it with PowerShell, Visual Studio or any other third-party tool.
Blob directory - Destination. A folder inside the container. This folder is used for storing your file in Azure Storage.
Local directory - Source. Location of the file you keep davinci.txt file.

Once all set, task is ready for uploading files.

Next step is, adding Azure HDInsight Create Cluster task on to Control Flow. Drag it and drop and open the editor for configuring it. This requires Azure Subscription Connection which has to be created with following items;

Azure subscription ID - this can be easily seen with Settings when accessing the subscription via Classic Portal (see Part 1)
The Management certificate thumbprint - this is what we created with Part I and uploaded to Azure. This should be browsed in Local Machine location and My store.

Once the connection is created, you need to set other properties;

Azure Storage Connection - Use the same connection created for upload task.
Location - Use the same location used for the storage
Cluster name - Set a unique name for this. This is your HDI name
Cluster size - set number of nodes you want for your cluster
User name - set the user name of administrator of your HDI.
Password - set a complex password for the user.

Second task in the control flow is ready. Next task is for executing HiveQL query for processing data. I have some posts written on Hive: What is Hive, What is Hive Database, What is Hive Table?, How to create a Hive table and execute queries using HDInsight. Have a look on it if you are new to Hive. Azure HDInsight Hive Task is the one we have to use for processing data using HiveQL. Drag and drop it, and configure like below.

Azure subscription connection - Use the same connection created for above task.
HDInsight cluster name - Use the same name given with previous task.
Local log folder - Set a local folder for saving log files. This is really important for troubleshooting.
Script - You can either set HiveQL as an in-line script or you can have you script in a file saved in a storage, and refer it. I have added the query as an in-line script that does;

Create an external table called Words with one column called text.
Execute a query that aggregates data in Words and insert the result to WordCount table.

DROP TABLE IF EXISTS Words;
CREATE EXTERNAL TABLE Words
(
 text string
) row format delimited 
fields terminated by '\n' 
stored as textfile
location '/inputfiles/';
DROP TABLE IF EXISTS WordCount;
CREATE TABLE WordCount AS
SELECT word, COUNT(*) FROM Words LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

We have added a task for uploading the file (you can upload many files into the same folder), a task for creating the cluster and a task for processing data in added files. Next step is, accessing the table WordCount and get the result into local environment. For this, you need a DataFlow task. Inside the DataFlow, have an ODBC data source for accessing Hive table and a destination as you prefer.

Let's configure ODBC source. Drag and drop and set properties as below.

ODBC connection manager - Create a new connection using Hive ODBC connection created with Part I.
Data access mode - Select Table Name as HiveQL stores the resultset into a table called WordCount.
Name of the table or view - Select WordCount table from the drop-down.
Columns - Make sure it detects columns like below. Rename them as you want.

Note that you cannot configure the source if you have not created and populated Hive table. Therefore, before adding the DataFlow task, execute first three control flow tasks that upload the file, create the cluster, process data and save data into the table. Then configure the DataFlow task as above.

Add a Data Reader destination for testing this. You can add any type of transformations if you need to transform data further and send to any type of destination. Enable Data Viewer for seeing data.

Now you need to go back to Control Flow and add one more task for deleting the cluster you added. For that, drag and drop Azure HDInsight Delete Cluster Task and configure just like the way you configure other Azure tasks.

That is all. Now if you run the package, it will upload the file, create the cluster, process the file , get data into local environment, and delete the cluster as the final task.

This is how you use SSIS for processing unstructured data with the support of HDI. You need to know that creating HDI on-demand takes long time (than I expected, already checked with experts, waiting for a solution). Therefore you may create the cluster and keep it in Azure if the cost is not an issue.

Saturday, May 21, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

A fully-fledged Business Intelligence system never ignore unstructured data. The reason is, you can never get the true insight without considering, consuming and processing all types of data available in an organization. If you design a BI solution and if you have both structured and unstructured data, how do you plan to process them?

Generally, processing unstructured data is still belong to Hadoop ecosystem. It is designed for that, and it is always better to handover the responsibility to Hadoop. We do BI using Microsoft SQL Server product suite, and SQL Server Integration Services (SSIS) is the component we use for handling ETLing. If there is an unstructured data set that needs to be processed as a part of ETL process, how can you get the support from Hadoop via SSIS for processing unstructured data and getting them back as structured data? The solution is given with Integration Services Feature Pack for Azure.

The Integration Services Feature Pack for Azure (download from here) provides us functionalities for connecting with Azure Storage and HDInsight for transferring data between Azure Storage and On-Premise data sources, and processing data using HDInsight. Once installed, you see newly added tasks in both Control Flow and Data Flow as below;

Assume that you have a file which is unstructured and it needs to be processed. And once processed you need to get the result back to your data warehouse. For implementing this using SSIS, you need to do following;

Install Integration Services Feature Pack for Azure (download from here)
Install Microsoft Hive ODBC Driver (download from here)
Generate a Self-Signed Certificate and upload to Azure Subscription.

Why we need Microsoft Hive ODBC Driver? We need this for connecting with Hive using ODBC. There are two drivers; one for 32-bit applications and other is for 64-bit applications. In order to use it with Visual Studio, you need 32-bit driver. It is always better to have both installed and when creating DSN, create the same DSN in both 32-bit and 64-bit System DSN. This is how you have to configure DSN for Hive in Azure HDInsight.

Open ODBC Source Administrator (32-bit) application and click Add button in System DSN tab.

Select the driver as Microsoft Hive ODBC Driver and configure it as below;

You can click on Test button for testing the connection. If you have not configured the cluster yet (in my case, it is not, we will be creating using SSIS), you will get an error. But still you can save it keep it. Note that it is always better to create another DSN using 64-bit ODBC Data Source Administrator with the same name.

Next step is, generating the certificate and add it Azure subscription. This is required for connecting to Azure using Visual Studio. Easiest way of doing this is, creating the certificate using Internet Information Services (IIS), export it using certmgr and upload it using Azure Classic Portal. Let's do it.

Open IIS Manager and go for Server Certificates. Click on Create Self-Signed Certificate for generating a certificate.

Once created, open Manage Computer Certificates and start Export Wizard.

The wizard starts with welcome page and then open Export Private Key page. Select No, do not export private key option and continue.

Select Base-64 Encoded X.509 (.CER) option in Export File Format and continue.

Next page asks you the file name for the certificate. Type the same name with the location you need to save it and complete the wizard.

Certificate is exported with the required format. Now you need to upload it to Azure Subscription. Open Azure New Portal and then open Azure Classic Portal. Scroll down the items and fins Settings, and click on it.

Settings page has Manage Certificate section. Click on it for opening it and upload the certificate created.

Everything needs by SSIS for working with Azure Storage and HDI is done. Now let's make the SSIS package to upload an unstructured data, create a HDI cluster on-demand, process uploaded file using Hive, download the result, and then finally remove the cluster because you do not need to pay extra to Microsoft.

Since the post it bit lengthy, let me make the next part as a new post. Read it from below link:
Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

Tuesday, July 21, 2015

How to create a Hive table and execute queries using HDInsight

I wrote a post on Hive (What is Hive, What is Hive Database, What is Hive Table?) discussing key elements of Hive. If you want to try with it, there are multiple ways of doing it, you can do it with a Hadoop cluster configured in your environment, using a sandbox provided by vendors, or using a cloud computing platform and infrastructure like Microsoft Azure or Amazon. This post speaks about how to use HDInsight that is Microsoft Hadoop Cloud Cluster, for performing Hive related operations.

First of all, you need to make sure that a Storage and HDInsight Cluster are created with your Azure account. This post explains how to do it: How to navigate HDInsight cluster easily: CloudXplorer.

Let's try to create a simple External table using a file that holds data like below;

Let's place this file in one of the HDFS location, in my case, I have created a folder called \MyFiles\CustomerSource and placed the file in that folder using CloudXplorer.

In order to create an External table and query, there are many ways of doing it. For implementations, Powershell, .NET or even SSIS can be used but for this, let's use the standard interface given with HDInsight: Query Console.

You need a user name and password for opening Query Console. If you have not given a specific user name when creating the cluster, your user name is admin. Password you have to use is the one you entered when creating the cluster. Once submitted, you should see the below page and should click on Hive Editor link.

Here is the code for creating an External table called customer. Note that complex data types such as map, array, struct have been used for handling data in the text file and no specific database is mentioned, hence table will be created on the default database. Last clause of the statement points to the location where we have data files.

create external table customer
(
 id int
 , name struct
 , telephone map
 , ranks array
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/MyFiles/CustomerSource/';

Place the code in the Editor and click on Submit. Once submitted, job will be created, and you should see that Status of the job is getting changed from Initialize, Running, and Completed. Once completed, you can click on View Details link in the job and see details related to the execution.

Done. External table has been created. Now we should be able to query the file using the table created, of course this Schema-on-read. Let's execute a query like below;

select name.lname, telephone["mobile"], ranks[0]
from customer;

Click on View Details to see the result. If need, result can be downloaded too.

Let's see more practical example with Powershell and SSIS with future posts.

Friday, July 17, 2015

What is Hive, What is Hive Database, What is Hive Table?

When I go through Apache Software Foundation, most attractive and most relevant projects I see for me are Hadoop and Hive. Hadoop is an open source platform that offers highly optimized distributed processing and distributed storage that can be configured with inexpensive infrastructure. Scalability is one of the key advantages related to Hadoop, it can be started with few number of servers and can be scaled out to thousands without any issue.

For more info on Hadoop:

Why do we need Hadoop and What can we do with it? [Hadoop for SQL Developer]

Hadoop cluster and how it stores a file when deploying

What is Schema-on-write and Schema-on-Read?

What is Hive now? Hive is a supporting project that was originally developed by Facebook, as an abstraction layer on top of MapReduce model. In order to understand Hive, MapReduce has to be understood.

MapReduce is a solution for scaling data processing which is one of main components of Hadoop. This means that it helps to do parallel data processing using multiple machines in Hadoop (or HDFS). It is considered as a framework as it is used for writing programs for distributed data processing. MapReduce programs requires two separate and distinct methods; Map and Reduce (optionally). Function related to these can be written using languages like Java, Perl and Python.

When it comes to a complex MapReduce implementation, many who do data analysis including database engineers find difficultly because languages to be used are not much familiar for them. Not only that there are some other constraints like time it takes time for implementing, complexities, and less re-usability. This made Facebook team to implement Hive.

Hive is data warehouse infrastructure built on top of Hadoop. It provides SQL-Like language called HiveQL allowing us to query data files in HDFS for processing data using familiar SQL-Like techniques. This converts our SQL commands to MapReduce jobs, hence we do not need to worry about MapReduce implementation.

What is Hive Database? Even though it appears as a relational database we are familiar with, it is not. It is just a name that can be used for grouping set of tables created, or it can be considered as namespace (like we used to group our classes, methods in .Net). When a hive database is created, it creates a folder with the name given suffixing .db. If the location is not specified, it will be created in /hive/warehouse folder, else folder will created in the given location in HDFS. For example, following code will create a folder called sales.db inside the /hive/wearehouse folder.

CREATE DATABASE sales;

What is Hive Table? It uses similar concept. When we create a table with relational database management systems, it creates a container for us with constraints we added like columns, data types, rules, and allows us to add records matching with constraints added. But Hive table does not create a container for us like that, it creates a schema on top of a data file we have placed in HDFS, or data files we are supposed to place as it uses schema-on-read not schema-on-write (read this What is Schema-on-write and Schema-on-Read?).

Hive supports two types of tables: Hive Managed Table and External Table. Hive Managed Tables creates a sub folder in side the database folder with a schema. And later we can place files into the folder, this is how record-insert process works though Hive does not offer interactive queries like INSERT, UPDATE, DELETE. Managed tables are maintained by Hive, dropping the table will drop files placed too.

External Table helps us to create a schema for reading data in files. Location clause is required when an external non-partitioned table is created pointing the folder that holds data files.

Here is an example for a managed table

USE sales;
CREATE TABLE customer
(
    customerid int
    , name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Here is an example for an external table

CREATE EXTERNAL TABLE customer
(
    customerid int
    , name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION '/xyz/abc/lmn';

Once the table is created, SQL-Like queries can be used for accessing data in files.

Let's see more on these with samples in future posts.

Tuesday, June 16, 2015

Different query patterns you see with HiveQL when comparing with TSQL

In recent past......... a casual conversation turned to a technical conversation......

Friend: It is a surprise to see an MVP studying hard on Open Source, anyway good see you in the club :).

Me: This does not mean that I will be completely moving to Open Source, but I like to lean some additional things and get involved with a good implementation.

Friend: Okay, you have been studying HIVE for last few weeks, tell me three types of queries you have not seen with SQL Server?

Me: Yes, have seen few, let me remind three, you can interchange blocks in SELECT like FROM clause before SELECT, one CREATE TABLE statement for creating table and loading data, and of course duplicating table with CREATE TABLE LIKE :).

Conversation continued with more interesting topics related to two different platforms, will make more posts on my-usage of open source components such as Hadoop, Pig and HIVE for solutions I work on, but thought to make a post on this three items;

There are no big differences between TSQL (or standard SQL) and HiveQL queries, but some are noticeable if you are an experienced TSQL developer. If you have used MySQL, you will not see much as HiveQL offers similar patterns.

Let me elaborate three things I mentioned with my reply, that will surely encourage you to start studying on HIVE if you have not started using it.

SELECT statement of SQL Server always start with SELECT clause and then FROM;

SELECT Col1, Col2
FROM Table1
WHERE Col3 = 1;

HiveQL allows the same but it can be written like this too;

FROM Table1
SELECT Col1, Col2
WHERE Col3 = 1;

If we need to create a table with SQL Server and load data, there are two ways with pros and cons.

CREATE TABLE NewTable
(
 Col1 int
 , Col2 varchar
);

INSERT INTO Table2
SELECT Col1, COl2
FROM OldTable;

-- or

SELECT Col1, Col2
INTO NewTable
FROM OldTable;

But HIVE offers an easy way of doing it.

CREATE TABLE NewTable
STORED AS RCFile
AS SELECT Col1, Col2 FROM OldTable;

When we need to duplicate a table structure, this is what we do with SQL Server;

SELECT Col1, Col2
INTO NewTable
FROM OldTable
WHERE 1=0;

and HiveQL facilitates this;

CREATE TABLE NewTable LIKE OldTable;

If I am not mistaken, some of these are supported with APS. Not only that HDInsight supports HIVE in all the ways, let's try to understand the usage of HIVE and combining it with SQL Server implementations with future posts.