Tuesday, July 21, 2015

How to create a Hive table and execute queries using HDInsight

I wrote a post on Hive (What is Hive, What is Hive Database, What is Hive Table?) discussing key elements of Hive. If you want to try with it, there are multiple ways of doing it, you can do it with a Hadoop cluster configured in your environment, using a sandbox provided by vendors, or using a cloud computing platform and infrastructure like Microsoft Azure or Amazon. This post speaks about how to use HDInsight that is Microsoft Hadoop Cloud Cluster, for performing Hive related operations.

First of all, you need to make sure that a Storage and HDInsight Cluster are created with your Azure account. This post explains how to do it: How to navigate HDInsight cluster easily: CloudXplorer.

Let's try to create a simple External table using a file that holds data like below;



Let's place this file in one of the HDFS location, in my case, I have created a folder called \MyFiles\CustomerSource and placed the file in that folder using CloudXplorer.


In order to create an External table and query, there are many ways of doing it. For implementations, Powershell, .NET or even SSIS can be used but for this, let's use the standard interface given with HDInsight: Query Console.


You need a user name and password for opening Query Console. If you have not given a specific user name when creating the cluster, your user name is admin. Password you have to use is the one you entered when creating the cluster. Once submitted, you should see the below page and should click on Hive Editor link.




Here is the code for creating an External table called customer. Note that complex data types such as map, array, struct have been used for handling data in the text file and no specific database is mentioned, hence table will be created on the default database. Last clause of the statement points to the location where we have data files.

create external table customer
(
 id int
 , name struct
 , telephone map
 , ranks array
)
row format delimited
fields terminated by '|'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile
location '/MyFiles/CustomerSource/';

Place the code in the Editor and click on Submit. Once submitted, job will be created, and you should see that Status of the job is getting changed from Initialize, Running, and Completed. Once completed, you can click on View Details link in the job and see details related to the execution.


Done. External table has been created. Now we should be able to query the file using the table created, of course this Schema-on-read. Let's execute a query like below;

select name.lname, telephone["mobile"], ranks[0]
from customer;


Click on View Details to see the result. If need, result can be downloaded too.


Let's see more practical example with Powershell and SSIS with future posts.

No comments: