When I go through Apache Software Foundation, most attractive and most relevant projects I see for me are Hadoop and Hive. Hadoop is an open source platform that offers highly optimized distributed processing and distributed storage that can be configured with inexpensive infrastructure. Scalability is one of the key advantages related to Hadoop, it can be started with few number of servers and can be scaled out to thousands without any issue.
What is Hive now? Hive is a supporting project that was originally developed by Facebook, as an abstraction layer on top of MapReduce model. In order to understand Hive, MapReduce has to be understood.
MapReduce is a solution for scaling
data processing which is one of main components of Hadoop. This means that it helps to do parallel data processing using multiple machines in Hadoop (or HDFS). It is considered as a framework as it is used for writing programs for distributed data processing. MapReduce programs requires two separate and distinct methods; Map and Reduce (optionally). Function related to these can be written using languages like Java, Perl and Python.
When it comes to a complex MapReduce implementation, many who do data analysis including database engineers find difficultly because languages to be used are not much familiar for them. Not only that there are some other constraints like time it takes time for implementing, complexities, and less re-usability. This made Facebook team to implement Hive.
Hive is data warehouse infrastructure built on top of Hadoop. It provides SQL-Like language called HiveQL allowing us to query data files in HDFS for processing data using familiar SQL-Like techniques. This converts our SQL commands to MapReduce jobs, hence we do not need to worry about MapReduce implementation.
What is Hive Database? Even though it appears as a relational database we are familiar with, it is not. It is just a name that can be used for grouping set of tables created, or it can be considered as
namespace (like we used to group our classes, methods in .Net). When a hive database is created, it creates a folder with the name given suffixing
.db. If the location is not specified, it will be created in
/hive/warehouse folder, else folder will created in the given location in HDFS. For example, following code will create a folder called
sales.db inside the
/hive/wearehouse folder.
CREATE DATABASE sales;
What is Hive Table? It uses similar concept. When we create a table with relational database management systems, it creates a container for us with constraints we added like columns, data types, rules, and allows us to add records matching with constraints added. But Hive table does not create a container for us like that, it creates a schema on top of a data file we have placed in HDFS, or data files we are supposed to place as it uses
schema-on-read not
schema-on-write (read this What is Schema-on-write and Schema-on-Read?).
Hive supports two types of tables:
Hive Managed Table and
External Table. Hive Managed Tables creates a sub folder in side the database folder with a schema. And later we can place files into the folder, this is how
record-insert process works though Hive does not offer interactive queries like
INSERT, UPDATE, DELETE. Managed tables are maintained by Hive, dropping the table will drop files placed too.
External Table helps us to create a schema for reading data in files.
Location clause is required when an external non-partitioned table is created pointing the folder that holds data files.
Here is an example for a managed table
USE sales;
CREATE TABLE customer
(
customerid int
, name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;
Here is an example for an external table
CREATE EXTERNAL TABLE customer
(
customerid int
, name string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION '/xyz/abc/lmn';
Once the table is created, SQL-Like queries can be used for accessing data in files.
Let's see more on these with samples in future posts.