Dinesh's Blog :::: Being Compiled ::::: Hadoop for SQL Developer

Showing posts with label Hadoop for SQL Developer. Show all posts

Friday, July 17, 2015

What is Hive, What is Hive Database, What is Hive Table?

When I go through Apache Software Foundation, most attractive and most relevant projects I see for me are Hadoop and Hive. Hadoop is an open source platform that offers highly optimized distributed processing and distributed storage that can be configured with inexpensive infrastructure. Scalability is one of the key advantages related to Hadoop, it can be started with few number of servers and can be scaled out to thousands without any issue.

For more info on Hadoop:

Why do we need Hadoop and What can we do with it? [Hadoop for SQL Developer]

Hadoop cluster and how it stores a file when deploying

What is Schema-on-write and Schema-on-Read?

What is Hive now? Hive is a supporting project that was originally developed by Facebook, as an abstraction layer on top of MapReduce model. In order to understand Hive, MapReduce has to be understood.

MapReduce is a solution for scaling data processing which is one of main components of Hadoop. This means that it helps to do parallel data processing using multiple machines in Hadoop (or HDFS). It is considered as a framework as it is used for writing programs for distributed data processing. MapReduce programs requires two separate and distinct methods; Map and Reduce (optionally). Function related to these can be written using languages like Java, Perl and Python.

When it comes to a complex MapReduce implementation, many who do data analysis including database engineers find difficultly because languages to be used are not much familiar for them. Not only that there are some other constraints like time it takes time for implementing, complexities, and less re-usability. This made Facebook team to implement Hive.

Hive is data warehouse infrastructure built on top of Hadoop. It provides SQL-Like language called HiveQL allowing us to query data files in HDFS for processing data using familiar SQL-Like techniques. This converts our SQL commands to MapReduce jobs, hence we do not need to worry about MapReduce implementation.

What is Hive Database? Even though it appears as a relational database we are familiar with, it is not. It is just a name that can be used for grouping set of tables created, or it can be considered as namespace (like we used to group our classes, methods in .Net). When a hive database is created, it creates a folder with the name given suffixing .db. If the location is not specified, it will be created in /hive/warehouse folder, else folder will created in the given location in HDFS. For example, following code will create a folder called sales.db inside the /hive/wearehouse folder.

CREATE DATABASE sales;

What is Hive Table? It uses similar concept. When we create a table with relational database management systems, it creates a container for us with constraints we added like columns, data types, rules, and allows us to add records matching with constraints added. But Hive table does not create a container for us like that, it creates a schema on top of a data file we have placed in HDFS, or data files we are supposed to place as it uses schema-on-read not schema-on-write (read this What is Schema-on-write and Schema-on-Read?).

Hive supports two types of tables: Hive Managed Table and External Table. Hive Managed Tables creates a sub folder in side the database folder with a schema. And later we can place files into the folder, this is how record-insert process works though Hive does not offer interactive queries like INSERT, UPDATE, DELETE. Managed tables are maintained by Hive, dropping the table will drop files placed too.

External Table helps us to create a schema for reading data in files. Location clause is required when an external non-partitioned table is created pointing the folder that holds data files.

Here is an example for a managed table

USE sales;
CREATE TABLE customer
(
    customerid int
    , name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Here is an example for an external table

CREATE EXTERNAL TABLE customer
(
    customerid int
    , name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION '/xyz/abc/lmn';

Once the table is created, SQL-Like queries can be used for accessing data in files.

Let's see more on these with samples in future posts.

Saturday, July 11, 2015

Hadoop cluster and how it stores a file when deploying

Hadoop is no longer a new word, everyone knows it and everyone knows why we need it. During my last two presentations, I explained Hadoop Cluster and how it stores a file when a file is placed. Here is the image I used for explaining Hadoop Cluster.

Hadoop cluster consists couple of components, including Master nodes and Slave nodes as main components. Master Nodes responsible for managing and coordinating services and tasks (Eg. using Name Node) and Slave Nodes responsible for storing and processing data providing resources like CPU and memory.

Generally, Hadoop is configured on rack-based servers. On top of each rack, network switch is configured for intra-rack communication and another network switch is configured for handling communication between rack switches and client that runs Hadoop client-related software.

Hadoop uses HDFS for holding files. It is responsible for breaking large files into smaller chunks (128MB - configurable), placing them in different slave nodes and replicating them for providing high availability. Here is a video that shows how a file is distributed in Hadoop Distributed File System;

Monday, July 6, 2015

SQL Server 2016 features: Polybase: SQL Server connector to Hadoop

** The latest article on this based on RC3 is available as SQL Server 2016 - Manipulating data in Azure Storage using PolyBase.

Polybase is not a new product to Microsoft SQL Server product suite but it is new to many of us or it is not a well known component to developers as it was limited to one edition. Microsoft started adding Big Data related components to Microsoft platform long time back and as a result of it, HDInsight (Hadoop in Azure) and Analytical Platform System (Parallel Data Warehouse) appliance were introduced. Poybase was part of APS and it was limited to APS edition. In a way, it acted as a bridge between Hadoop and relational databases, allowing us to query files (all structured, semi-structured and unstructured) in the Hadoop using familiar language: SQL.

Good news is, now Microsoft has included this functionality to other editions as part of Microsoft SQL Server 2016. It allows you to process data in files (obviously large files) using SQL as they were tables in the database when files are stored in Azure Blob Storage or Hadoop.

As you know, SQL Server 2016 is still CTP and it is CTP 2.1 when this post is written. This post shows how to use Polybase when text files are stored in Azure Hadoop. Note that the code written is based on current version and implementation may be changed with future releases.

How do you start? First of all, you need SQL Server 2016 installed in your machine. Here are the steps for installing it: Installing SQL Server 2016 CTP 2.1 - Step by Step and Issues faced

Once installed, you need to enable hadoop connectivity using sp_configure. If you run sp_configure, you will see that zero is set with it.

Since I am going to access Hadoop configured in Azure, it needs a value like 4. For more information on values can be set, read this: https://msdn.microsoft.com/en-us/library/mt143174.aspx

Once set, restart SQL Server Service and Hadoop related services (shown in the image below). Note that those services are not visible via SQL Server Configuration Manager.

In order to test Polybase, we need a HDInsight cluster configured with Azure. If have already created one, make a note on storage name and primary key of it. If you have not created, have a look on this post for creating one and accessing it via CloudXplorer: How to navigate HDInsight cluster easily: CloudXplorer

Let's create a simple text file for testing and place it in HDInsight. I have created a simple text file (of course, it is too small and not a best example for Hadoop, but will use it for testing) and uploaded to a new folder called testfolder in HDInsight using CloudXplorer.

Here are the next steps:

1. Enable 4631 flag using TRACEON and create a Credential for accessing Azure storage.

-- create a database for testing
CREATE DATABASE TestPolybase;
GO

-- connect with it
USE TestPolybase;
GO

-- enable 4631
DBCC TRACEON(4631, -1);

-- create the master key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Pa$$w0rd';

-- create credential for the azure storage
-- Note that even though this creates it
-- I had an error with CREATE EXTERNAL DATA SOURCE
-- hence, did not use this
-- changed core-site.xml instead

--CREATE CREDENTIAL Azure01DineshCredential
--WITH IDENTITY = 'azure01dineshhdistorage' -- storage name
-- , SECRET = 'abc23232njjj'; -- primary key

However, this did not work for me. CREATE CREDENTIAL allowed me to create it but I could not use it with CREATE EXTERNAL DATA SOURCE. It threw following error with it;

Msg 46516, Level 16, State 26, Line 16
The specified credential cannot be found

The reason for this could be, not adding credential to the database but to the server. However I could not add it to the database too;

Therefore, I added the credential manually to core-ste.xml file. File is located generally in:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf

For more info on this, read this: https://msdn.microsoft.com/en-us/library/ms189522.aspx

2. Then we need to create an External Data Source. This sets the location we need to access for querying files. Note that I have not used CREDENTIAL keyword and LOCATION has been set with storage name and container name.

-- create the external data source
-- note that CREDENTIAL has not been used
CREATE EXTERNAL DATA SOURCE Azure01DineshHDIstorage
WITH
(
 TYPE = HADOOP,
 -- wasbs://storagename@containername.blob.core.windows.net/
 LOCATION = 'wasbs://azure01dineshhdinsight@azure01dineshhdistorage.blob.core.windows.net/'
 --, CREDENTIAL = Azure01DineshCredential
);

For more info on this, read this: https://msdn.microsoft.com/en-us/library/dn935022.aspx

3. Next step is, creating a File Format. This helps to understand the file we have stored when reading the file as a table.

-- create the file format required
CREATE EXTERNAL FILE FORMAT FileFormat
WITH
(
 FORMAT_TYPE = DELIMITEDTEXT
 , FORMAT_OPTIONS
  (
   FIELD_TERMINATOR = ','
  )
);

For more info on this, read: https://msdn.microsoft.com/en-us/library/dn935026.aspx

4. Now we need to create an External Table that references data stored in the location we specify. Here is the code for creating the table. Since I have stored the Log.txt file in testfolder created in HDInsight, Location is set with '/testfolder'. Data_Source is set with the one I created and File_Format is set with the format created. Reject_type and Reject_value indicate that ignore 10 invalid records.

-- create the table using file format created
-- and for the file uploaded
CREATE EXTERNAL TABLE LogData
(
 id int
 , userid int
 , nofseconds int
)
WITH
(
 LOCATION = '/testfolder'
 , DATA_SOURCE = Azure01DineshHDIstorage
 , FILE_FORMAT = FileFormat
 , REJECT_TYPE = VALUE
 , REJECT_VALUE = 10
);

For more info on this, read: https://msdn.microsoft.com/en-us/library/dn935021.aspx

All done. If you go through Object Explorer, you should see all we have created.

Now we should be able to query the file like below;

Still this is still being implemented, we cannot test many things against Hadoop but we can understand how it is going to be with future releases.

Friday, July 3, 2015

What is Schema-on-write and Schema-on-Read?

When subjects related to Database Management Systems is discussed, the term Schema-On-Write is not widely used (or not popular), though that is what we have been using as a method for writing and holding data. This method forces the record to be matched with the defined schema before writing it into the storage. For example, in order to hold Customer data, we create a table in a traditional database with appropriate columns, with fixed data types. When a record is inserted, the record has to be aligned with defined columns (defined structure) and the record gets validated (via columns defined, type of them and constraints) before inserting to the table. This slows down the insert operation but consistence of the record is guaranteed. This method is called as Schema-on-write.

Schema-On-Read is a method that matches data to the schema as it is read from storage. Traditional database management system does not employ this but new platform like Hadoop uses methods like this for processing data, specifically on semi-structured and unstructured data. For example, if there is a dataset that holds some log records formatted as semi-structured or unstructured, content of it can be read with a defined schema as per the requirement. The requirement may need only few elements of the content and schema defined for reading addresses only required elements. There can be another requirement on the same content but needs to read differently, new schema is applied when reading. This slows down reading operation as data that is read has to be checked with defined schema.

Wednesday, July 1, 2015

Why do we need Hadoop and What can we do with it? [Hadoop for SQL Developer]

If someone comes and tells;

We have a large dataset, let's use Hadoop for processing this.

What would be your answer? If you are an IT enthusiast, the most common and expected answer would be: Yes, why don't we? But is it the right way of using Hadoop? We love technology but it does not mean that we have to always go for the latest or most popular one even though it is not the right one for the current problem. You may solve the issue with traditional Database Management System without implementing Hadoop, or you may implement Hadoop because there is no other options and it is the right way of addressing the problem. How do you decide?

Let's try to understand Hadoop first:

Apache Hadoop is a scalable and fault-tolerance Open Source Framework that is highly optimized for distributed processing and storage run on inexpensive hardware (commodity hardware).

Okay, why we need it?

The simplest answer for this is, we can process a massive amount of data using Hadoop for seeing the insight, quickly and efficiently.

What else, is it all about processing data? Generally, we can use Hadoop for following;

Hadoop as an ETL platform
A common requirement on ETLing with Big Data is, processing an unstructured, a large amount of data and make a structured result. This is not an easy operation with traditional DBMSs and ETL tools. Therefore, Hadoop can be used for processing an unstructured dataset and producing a structured dataset.
Hadoop as an Exploration Engine
Analysis requires complex logic to apply on structured, semi-structured and unstructured data. Hadoop offers many tools for analyzing data efficiently, providing high performance on analysis as data stored in Hadoop cluster.
Hadoop as a data storage
Since scalability is part of the Hadoop, a massive amount of data storing with Hadoop is beneficial. It automatically provides fault-tolerance and high availability. As data is replicated, at least with three nodes, data read operations can access the best node, even when one node is not available due to a fault. Not only that, when more space is required, more nodes can be added without any limitations for expanding the cluster.