Dinesh's Blog :::: Being Compiled ::::: Configuring HDInsight Hadoop Cluster and running a sample MapReduce Job

Saturday, September 26, 2015

Configuring HDInsight Hadoop Cluster and running a sample MapReduce Job

Apache Hadoop is an open source solution for distributed data processing and storage that consists a cluster of servers for holding data, storing in a distributed file system named HDFS (Hadoop Distributed File System). This is the solution widely used for handling Big Data and many software vendors offer the same with their platforms. Microsoft offers this as HDInsight.

HDInsight is a cloud-based distribution of Hadoop that is fully compliant with the Hadoop open source project. This offer comes as a service in Microsoft Azure and provisioning and decommissioning nodes of the cluster as-required not as difficult as on-prem-managed clusters.

This speaks about how to create a HDInsight cluster with the new Azure portal and run the sample MapReduce job for testing, using the remote machine. If you need more details on Haddop, HDFS, and related project called Hive, read following;

Why do we need Hadoop and What can we do with it? [Hadoop for SQL Developer]

Hadoop cluster and how it stores a file when deploying

What is Hive, What is Hive Database, What is Hive Table?

Let's start with the new Azure portal. Go to https://portal.azure.com and login to your portal. Once logged in, open Storage Accounts (Not Storage Accounts (classic)) and create a new storage account. HDInsight cluster requires at least one storage account for HDFS. If required, more than one Storage account can be attached to the cluster. Best practice with Storage account for Hadoop is, create one and use it when the cluster is created, This allows us to maintain the storage, create the cluster when required, and delete it when it is not required without deleting the storage, retaining data files placed.

Next step is, creating the cluster. Open HDInsight Clusters and create a new one using the Storage Account created. Note that the credentials for accessing the cluster goes as admin (which can be changed if want) and Remote Desktop enables accessing the cluster using Remote Desktop Connection.

You can use Node Pricing Tiers section for setting up nodes required for the cluster, deciding the tier for Worker nodes and Head node.

Generally, it takes about 10-15 minutes for setting up the cluster. Once it is done, navigate through it and open it using Remote Desktop Connection.

When the Remote Desktop is open, HDFS Commands can be used for navigating and manipulating folders and files, running MapReduce jobs and executing commands for related projects such as Pig and Hive. Double click on the shortcut called Hadoop Command Line for opening.