Dinesh's Blog :::: Being Compiled ::::: Big Data

Showing posts with label Big Data. Show all posts

Saturday, April 2, 2022

Azure Synapse Analytics - Part II - Hypersapce

*** This is the session I have done for Sri Lanka Data Community – Feb 2022 episode. This does not contain the entire event.

With modern data solutions, where we see data warehousing and big data, analytics are not limited to columnar based queries, highly selective queries are part of the trend now. Here is the way of optimizing such queries with Hyperspace: An indexing subsystem for Apache Spark. Read more at: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-performance-hyperspace?pivots=programming-language-csharp/?WT.mc_id=DP-MVP-33296

https://github.com/microsoft/hyperspace/?WT.mc_id=DP-MVP-33296

This talks about - What is Hyperspace and how it helps us to add indexes - How to see indexes, what changes it makes to the data lake - How to check whether indexes are used for queries - How to compare indexed query and non-indexed query with given APIs If you have any questions, please add to the comment section.

Monday, January 15, 2018

Introduction to Azure Data Lake Analytics - and basics of U-SQL

I made another video on Azure Data Lake, specifically on Azure Data Lake Analytics. This is a 40-minutes video and it discusses following items along with demonstrations;

What is Azure Data Lake Analytics
Data Lake Architecture and how it works
Comparison between Azure Data Lake Analytics, HDInsight and Hadoop for processing Big Data.
What is U-SQL and basics of it.
Demo on How to create an Azure Data Lake Analytics account
Demo on How to execute a simple U-SQL using Azure Portal
Demo on How to extract multiple files, transform using C# methods and referenced assembly, making multiple results with bit complex transformations using Visual Studio.

Here is the video.

Thursday, January 4, 2018

Introduction to Azure Data Lake Store - Does it replace the Data Warehouse?

I have made a video on Azure Data Lake and Azure Data Lake Store and published in YouTube. This video explains What is Azure Data Lake, Its Characteristics, Components of it, How it works, Usage, Comparison between Data Lake Store and Blob Storage, Modern Data Warehousing and How can we use Azure Data Lake with modern data warehousing.

Have a look on it and let me know you thoughts or questions you have.

Wednesday, October 18, 2017

tempdb becomes full when updating Clustered Columnstore index

I had a table with 3.5 billion records and wanted to update one column in all records. This table was not partitioned though the design says that it should be partitioned. With spec of 2 processors (12 cores) and 64 GB RAM, it could never complete the update as it makes the tempdb full, almost 800GB, consuming all space in the drive.

I am still not sure about the exact reason for this but listing out possible reasons that could help you too.

Table is not partitioned. I should have applied partitioning before loading billions of records.
Clustered columnstore index needs to be rebuilt. May it takes long time because the index is fragmented (had no way of checking as everything was urgent).
May be, it is not something to be surprised, update means, delete and insert, that makes records adding delta and then move to rowgroups, hence takes time and needs extra support from tempdb.
Not enough memory for completing the UPDATE process, hence use tempdb.

However, I manage to execute the UPDATE as a batch process. Although it took a significant time, it did not use tempdb (it might have slightly used but did not notice).

Here is the code I finally used;

WHILE EXISTS (SELECT * FROM dbo.Table01 WHERE Column1 IS NULL)
BEGIN

 UPDATE TOP (1000000) dbo.Table01
  SET Column1 = RIGHT(Column2, 4)
 WHERE Column1 IS NULL
END

Monday, October 2, 2017

Introduction to Azure HDInsight

I have written few posts on HDInsight and thought to make series of Video on it as well. Here is the first one, with this video, I discuss, what is Hadoop, What is HDInsight, Configuration of HDInsight, Important Settings related to HDInsight and how to access the cluster using Azure Portal and PuTTY.

Watch and see!

Tuesday, September 12, 2017

Run Hive Queries using Visual Studio

Once HDInsight cluster is configured, we generally use either the portal dashboard (Powered by Ambari) or a tool like PuTTY for executing queries against data loaded. Although they are not exactly a developer related tools, or in other words, not an IDE, we had to use because we did not have much options. However, now we can use the IDE we have been using for years for connecting with HDInsight and executing various types of queries such as Hive, Pig and USQL. It is Visual Studio.

Let's see how we can use Visual Studio for accessing HDInsight.

Making Visual Studio read for HDInsight

In order to work with HDInsight using Visual Studio, you need to install few tools on Visual Studio. Here are the supported versions;

Visual Studio 2013 Community/Professional/Premium/Ultimate with Update 4
Visual Studio 2015 any edition
Visual Studio 2017 any edition

You need to make sure that you have installed Azure SDK on your Visual Studio. Click here for downloading the Web Platform Installer and make sure following are installed;

This installs Microsoft Azure Data Lake Tools for Visual Studio as well, make sure it is installed.

Now your Visual Studio is ready for accessing HDInsight.

Connecting with HDInsight

Good thing is, you can connect with your cluster even without creating a project. However, once the SDK is installed, you can see new Templates called Azure Data Lake - HIVE (HDInsight), Pig (HDInsight), Storm (HDInsight) and USQL (ADLA) and HIVE template can be used for creating a project.

Project creates one hql file for you and you can use it from executing your Hive Queries. In addition to that, You can open Server Explorer (View Menu -> Server Explorer), and expand Azure (or connect to your Azure account and then expand) for seeing all components related to Azure.

As you see, you can see all databases, internal and external tables, views and columns. Not only that, by right-clicking the cluster, you can open a windows for writing a query or viewing jobs. Here is the screen when I use the first option that is Write a Hive Query.

Did you notice Intelli-Sense? Yes, it supports with almost all metadata, hence it is really easy to write a query.

Executing Queries

If you need to see records in tables without limiting data with predicates or constructing the query with additional functions, you can simply right-click on the table in Server Explorer and select View top 100 Rows.

If you need to construct a query, then use the above method for opening a window and write the query. There are two ways of executing the code: Batch and Interactive. Batch mode does not give you the result immediately but you will be able to see or download once the job submitted is completed. If you use the Interactive, then it is similar to SSMS result.

If you use the Batch mode, you can see the way job is getting executed. Once the job is completed, you can click on Job Output for seeing or downloading the output.

As you see, there is no graphical interface to see the job execution. Visual Studio will show the job execution using a graphical interface only when the job is executed by Tez Engine. Remember, HDInsight will always use Tez Engine to execute Hive Queries but simpler queries will be executed using Map Reduce Engine.

See this query that has some computation;

Can we create table with this IDE?

Yes, it is possible. You can right-click on the your database in Azure Server Explorer and select Create table menu item.

Let's talk about more on this with later posts.

Sunday, September 10, 2017

How to refer files in HDInsight - Azure Storage using different ways

If you have started working with Big Data, you surely need to check the Microsoft support on it via Azure platform - HDInsight service. HDInsight allows you to create a Hadoop environment within few minutes and it can be anytime scaled out or in based on your requirements. I have written few posts on this, you can have a look on them using following links;

Creating HDInsight Hadoop Cluster with Integration Services (SSIS) Azure HDInsight Create Cluster Task

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

How to Connect HDInsight using Putty

Configuring HDInsight Hadoop Cluster and running a sample MapReduce Job

SQL Server 2016 features: Polybase: SQL Server connector to Hadoop

In order to work with data loaded to HDInsight, or Hadoop, data files have to be refereed using supported syntax. There are multiple ways for referring files in the storage with HDFS. Here are the ways;

Fully qualified path with wasb(s) protocol

This is most accurate and correct way of referring files in the storage. Here is the pattern;

wasb://container_name@storage_account_name.blob.core.windows.net/file_path.

Here is an example using Putty, connecting with HDInsight and reading a file (processed with Hive) exist. My container name is dinesqlashdinsight and storage name is dinesqlasstorage. File path is data/cleanlog/000000_0 (this is a Hive table in fact).

Connecting with the default container

If your files are in the default container, you can skip the container name and storage name as follow;

wasb:///file_path

Note the three slashes. It is required when you do not mentioned the container name.

Connecting using Hadoop/Linux/Unix native ways

Generally, when you work with Hadoop using Linux/Unix, you refer files without the protocol. Azure HDInsight supports the same and we can refer files using that syntax.

file_path

Do I need double quotes for my paths?

It is required when you have some odd characters like equal (=) sign with your path. See the example below. I try to read a data file exist in a the cluster and the path has equal signs, hence path is encased with double quotes.

Monday, March 6, 2017

Creating HDInsight Hadoop Cluster with Integration Services (SSIS) Azure HDInsight Create Cluster Task

I have written two posts on HDInsight + SSIS based on Azure Feature Pack and they speak about how to creating a HDInsight Cluster and processing semi-structured data using Hive. You can read them at;

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

With the latest release of Azure Feature Pack, some of the settings have been changed and the creating is easier with new task than the previous one. This posts discusses the way of configuring Azure HDInsight Create Cluster Task.

In order to create a HDInsight Hadoop Cluster using SSIS, you need to make sure that following steps have been done;

Azure Feature Pack is installed.
Azure Storage is created and you have the name and key of it.
Azure Active Directory Application is created and linked with the subscription used.

Since the installation of Azure Feature Pack is straightforward, we do not need to discuss it here. Again, creating a Storage is also a very common task with almost all Azure related implementation, hence it is not going to be discussed too. Let's start with Active Directory Application creation.

Creating an Active Directory Application and link to the subscription

Here are the steps for creating it.

1. Login to the portal and open the Azure Active Directory blade.

2. Click on App Registration and click Add.

3. In the Create Blade, enter the Name of the app and select the Application Type as Web app / API. Enter the Sign-on URL as you want. It can be changed later as your need, hence enter something and continue even though the domain is not registered with the Azure.

4. Once the application is created, you need to take the Application ID copied.

5. You need to create an Application Key and take a copy of too. Go to Keys under Settings and create one. Note the message appeared. It is better to get a copy of the key before closing the blade.

6. You need the Tenant Id for creating the cluster. Go to Properties of Active Directory Blade and get the Directory Id. It is the Tenant Id.

7. You need two more values to be copied and kept. Go to Subscription Blade and take a copy of your Subscription Id. Other one is resource group used. Take copy of its name too. You need to make sure everything is under one resource group.

8. Next is assigning or linking the application created with your subscription. Open your Subscription Blade and select the subscription you need to use. Then select Access Control (IAM) and click on Add button.

9. With the Add Access Blade, select the Role you need to set with the Application. You need to make sure that you select the right role for this.

10. With the Add Users Blade, search for the Application Name and select it.

At the end of this steps, you have;

Application Id
Application Key
Tenant Id
Subscription Id
Resource Group Name

Let's start with SSIS package now. Here are the steps for configuring it.

Creating a SSIS Package and configuring Azure HDInsight Create Cluster Task

Here are the steps doing it.

1. Create a SSIS project and a package. Drag and Drop Azure HDInsight Create Cluster Task. Open the properties of the task.

2. Open the properties of the task and select < New Connection...> of AzureResourceManagerConnection. Create a connection using copied items: Application Id, Application Key, and Tenant Id.

3. Select < New Connection...> for AzureStorageConnection. Enter the Storage account name and Account key that will be used by the cluster.

3. Next, fill the following;

Subscription Id - enter the copied one
ResourceGroup - enter the resource group used for the storage.
Location - Select the location used for creating the storage. Everything should be in the same locaiton.
ClusterName - enter the name of the cluster you need.
ClusterSize - enter the number of Worker Nodes you need in your cluster.
BlobContainer - enter the name of the Container that should be created for holding cluster files in the storage.
Username and Password - enter the user name and password for the cluster.
SshUserName and Password - enter the user name and the password for accessing the cluster remotely.

Everything required is done. If you execute the package or task now, it will create the cluster for you.

Genrally it takes about 10-15 minutes. You can open the portal and verify once it is created.

Remember, once this is created, whether you use it or not, you will be charged. Therefore, it is better to delete the cluster once the operation required is completed with it. You can use Azure HDInsight Delete Cluster Task for that. Here is the settings of it.

Saturday, February 11, 2017

Azure Feature Pack for Integration Services (SSIS) - New Version Available

Microsoft has made the latest version of Azure Feature Pack for Integration Services (SSIS) and it is available for downloading. This was released on 28-Dec-2016 and there are few addition components added when comparing with the previous version.

Why we need this? Simply to work with Big Data as part of our ETL solution with the support of Azure Big Data platform which is HDInsight and supportive projects.

I wrote few posts on previous version of this, you can read more on them at;

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

The above posts explain how to download and install the feature pack, configure tasks for accessing HDInsight and process some unstructured data using tasks given with SSIS.

You can get the latest version of Azure Feature Pack from: https://msdn.microsoft.com/en-us/library/mt146770.aspx

I noticed one Control Flow new Task; Azure SQL DW Upload Task and two Data Flow items; Azure Data Lake Store Source and Azure Data Lake Store Destination that were not available with the previous version.

Previous Version:

Latest Version:

Need to test these new items, must be really interesting, will surely share them via blog posts.

Sunday, January 3, 2016

Big Data: Is it 3 Vs, 4 Vs or 5 Vs?

Big Data is not a new term for anyone now and the definition of it with widely-known 3 Vs or 3 main attributes has already been discussed in many forums in detail and they are well known. Although it is described with famous three attributes; Volume, Velocity, and Variety, few more attributes are discussed in some forums by experts, highlighting the significance of them, probably marking them as more important than initial attributes. Since this is still being asked during sessions, thought to put them in simple way for explaining all 5 Vs.

Volume:

This was considered as the key for Big Data, The Big part was added because of vast amount of data generated by many sources every second that makes traditional approaches and technologies used for processing data, powerless and weak. The world of data being generated daily started with Megabyte to Gigabyte, and within a shorter period, now it is more about Exabyte to Zettabyte. This vast amount of data is generated with 1.49 billion monthly active users in FB, 175 millions tweets per day and much more. However, modern big data technology can process data which was either ignored or discarded before.

Velocity:

This is all about speed of data generation. It is not about hundreds of transactions generated in a second but thousands, perhaps millions of transactions generated with continuous stream, and capture them for analysis without even storing them in a database, and then use for making a decision related. Again, big data technologies have given us many ways for handling velocity which was previously either impossible or hard-to-process.

Variety:

We are familiar with structured, neat data, easy to capture, easy to hold and of course not that hard to process them for analysis. But it cannot be forgotten that, 80%-90% data is unstructured and always ignored because of difficulties related to capturing, holding and processing. Now Variety: Different form of data, dataset like, sentiment data, clickstream data, photos, videos, documents can be easily handled using modern big data projects.

Veracity:

Here is something we do not discuss much with big data. This is all about uncertainty of data, trustworthiness of data. Because of the volume, velocity and variety, a quality data set cannot be expected, accuracy is not guaranteed, and in most cases, it is not considered. However, we can work with such data with big data technologies and it helps us to handle it too.

Value:

Perhaps, the most important V. Since we love modern technologies, we always get ourselves trapped with buzz words and initiate solutions based on them without knowing whether the solution adds something valuable to the business or not. Make sure your big data solution gives something significant to the business, it allows you to see the insight, that is Value of big data.

Thursday, June 18, 2015

Looking for a Hadoop Cluster for testing? Let's configure Hortonworks HDP Sanbox

The latest buzzword in IT, or more particularly in data analytic is Big Data. It does not come alone, it always comes with Hadoop which offers distributed storage and processing. Everyone loves to do some experiments with new technologies, or popular technologies, hence everyone loves to do the same with Big Data and Hadoop. But setting it up is not an easy task and cloud HDP subscriptions offered by various vendors are not so flexible in terms of trial-period given. However, if you really want, you can use Sanboxes offered by some vendors for testing Hadoop implementations.

One of the HDP cluster I used for testing is HDInsight. Since the trail is limited, searched for alternatives and found the Sandbox offered by Hortonworks. This sandbox is configured as a self-contained virtual machine and it can be simply used without connecting to cloud. It does not come with multiple nodes, means that all Name Node, Job Tracker, Data Node, etc. are in same virtual machine. You will not be able to get the exact picture of distribution but you can do everything you need to do with Hadoop with this.

Here are the steps for configuring HDP sandbox.

Visit http://hortonworks.com/. Click on Hortonworks Sandbox under Get Started menu that is the top menu.

This takes you to a Download and Install page. At this moment, HDP 2.2.4 is the stable and reliable version, but it offers HDP 2.3 - Preview too. HDP 2.2.4 comes in three flavors; VirtualBox, VMWare and HyperV. Download the best suited for you.

Make sure you download the Install Guides too. Configuration and usage is same for all three types, for this post, I will assume that you download VirtualBox virtual machine which is Sandbox_HDP_2.2.4.2_VirtualBox.ova. Once this is downloaded, you need to import it, it is fairly straight forward, all instructions are given with the Installation Guide, follow it for importing into your VirtualBox environment (or VMWare or HyperV).

After importing, all you have to do is, click on Start button. If your machine is ready for running virtual machines, it should start without any issue, however it is very common to see following error message with most of VMs;

Failed to open a session for the virtual machine Hortonworks Sandbox with HDP 2.2.4.2.

VT-x is not available. (VERR_VMX_NO_VMX).

Result Code: E_FAIL (0x80004005)
Component: Console
Interface: IConsole {8ab7c520-2442-4b66-8d74-4ff1e195d2b6}

There can be two reasons for that. Once is, not enabling Visualization in BIOS. Second can be, incompatibility with other virtual environments. If Virtualization is not enabled in your machine, boot the machine with BIOS and enable it. If you still get the same error, and if you are running with Windows 8, make sure you disable HyperV. This thread discusses the same, follow it: http://h30434.www3.hp.com/t5/Desktop-Hardware/How-to-Enable-Intel-Virtualization-Technology-vt-x-on-HP/td-p/3198063.

And this video shows how to disable HyperV for addressing the same error: https://www.youtube.com/watch?v=Y56boAsdptw.

Once everything is done, you should be able to start it and you should see a scree like below.

As it says, there are two ways of accessing this; you can press ALlt+F5 for logging, user id is root and password is hadoop.

Once login you can continue with your commands for working with Hadoop.

In addition to that, GUI is given too. As the first screen explains, open a browser and go for http://127.0.0.1:8888/.

Then click on http://127.0.0.1:8000 for opening hue (Hadoop User Experience). It allows you to do your Hadoop work easily.

Tuesday, June 16, 2015

Different query patterns you see with HiveQL when comparing with TSQL

In recent past......... a casual conversation turned to a technical conversation......

Friend: It is a surprise to see an MVP studying hard on Open Source, anyway good see you in the club :).

Me: This does not mean that I will be completely moving to Open Source, but I like to lean some additional things and get involved with a good implementation.

Friend: Okay, you have been studying HIVE for last few weeks, tell me three types of queries you have not seen with SQL Server?

Me: Yes, have seen few, let me remind three, you can interchange blocks in SELECT like FROM clause before SELECT, one CREATE TABLE statement for creating table and loading data, and of course duplicating table with CREATE TABLE LIKE :).

Conversation continued with more interesting topics related to two different platforms, will make more posts on my-usage of open source components such as Hadoop, Pig and HIVE for solutions I work on, but thought to make a post on this three items;

There are no big differences between TSQL (or standard SQL) and HiveQL queries, but some are noticeable if you are an experienced TSQL developer. If you have used MySQL, you will not see much as HiveQL offers similar patterns.

Let me elaborate three things I mentioned with my reply, that will surely encourage you to start studying on HIVE if you have not started using it.

SELECT statement of SQL Server always start with SELECT clause and then FROM;

SELECT Col1, Col2
FROM Table1
WHERE Col3 = 1;

HiveQL allows the same but it can be written like this too;

FROM Table1
SELECT Col1, Col2
WHERE Col3 = 1;

If we need to create a table with SQL Server and load data, there are two ways with pros and cons.

CREATE TABLE NewTable
(
 Col1 int
 , Col2 varchar
);

INSERT INTO Table2
SELECT Col1, COl2
FROM OldTable;

-- or

SELECT Col1, Col2
INTO NewTable
FROM OldTable;

But HIVE offers an easy way of doing it.

CREATE TABLE NewTable
STORED AS RCFile
AS SELECT Col1, Col2 FROM OldTable;

When we need to duplicate a table structure, this is what we do with SQL Server;

SELECT Col1, Col2
INTO NewTable
FROM OldTable
WHERE 1=0;

and HiveQL facilitates this;

CREATE TABLE NewTable LIKE OldTable;

If I am not mistaken, some of these are supported with APS. Not only that HDInsight supports HIVE in all the ways, let's try to understand the usage of HIVE and combining it with SQL Server implementations with future posts.

Wednesday, April 15, 2015

What pushes SMP Architecture data warehouse to MPP Architecture

We have been developing data warehouses, centralizing enterprise data and addressing business intelligence requirements. Generally, almost all data warehouses built were based on traditional architecture which is called SMP: Symmetric Multi-Processing. Even though we use different design strategies for designing data warehouses for improving the performance and managing the volume efficiently, the necessity on scaling up often comes up. No arguments, without much considerations on factors related, we tend to add more resources spending more money for addressing the requirement but at a certain level, we need to decide, we need understand that the existing architecture is not sufficient enough for continuation, it needs a change, SMP to MPP.

What is SMP architecture? This architecture is a tightly coupled multi-processors that share resources, connecting to a single system bus. With SMP, system bus limits scaling up beyond a certain limit and, when number of processors and data load increases, the bus can become overloaded and a bottleneck can occur.

MPP, Massively Parallel Processing is based on shared-nothing architecture. MPP system uses multiple servers called Nodes which have dedicated, reserved resources and executes distributed queries with nodes independently offering much performance than SMP.

How do we know the boundary or what factors can be used for determining the necessity of MPP? Here are some, these will help you to decide.

Here is the first one;

This is all about data growth. Yes, we expect an acceptable data growth with data warehousing but if it increases drastically, and if we need to continuously plug more and more storage, it indicates a necessity of MPP. We are not talking about megabytes or gigabytes but terabytes ore more. Can't we handle the situation just adding storage? Yes, it is possible but there will be definitely a limit on it. Not only that, the cost goes up too. We do not see this limitation with MPP and in a way, adding additional storage after the initial implementation might not be as expensive as SMP.

Here is the second;

If somebody talks about BI today, the aspect of Real-time or Near-Real-Time is definitely a concerned area. Traditional implementation of data warehouse manages this up to some extent but not fully, mainly because of the capacity, loading and complexity of the query. Generally, with Microsoft platform, we use SSIS for data loading and de-normalized, semi-normalized tables designed either as general tables or star/snowflake structured tables for holding data. Assume that user requests real-time data and same fact table that contains billions of records have to be accessed, then performance of the query might not be at the expected level. With MPP, since data can be distributed with multiple nodes, performance on data retrieval is definitely fast and real-time queries are efficiently handled.

Third one;

Traditional data warehouse requires structured, in other words known relational formatted data. However modern BI is not just based on this, data with unknown structures are not rare and often required for most of the analysis. How do we handle this? One data warehouse for known, structured data and another for unstructured data? Even though we maintain two data warehouses, how an analysis can be performed combining these two? Can traditional tools attached with exiting architecture be used for combining them efficiently, process them fast and produce required result? No, it is not possible, and it means it is high time for MPP. This does not mean that MPP handles all these area but it supports. MPP helps to process unstructured data much efficient than SMP and Microsoft platform allows to combine structured and unstructured data with user-friendly interface using its solution which based on MPP.

Here is the forth:

Mostly, we extracts data from on-premises data storage and traditional ETLing handles this well. However, data generation is not limited to on-premises applications with modern implementation, many important data is generated with cloud applications. In a way, this increases the load as well as the complexity. And sometime this changes traditional ETL into ELT. MPP architecture has capabilities to handle these complexities and improve the overall performance, hence this reason can be considered as another reason for moving from SMP to MPP.

There can be some more reasons, but I think these are the significant ones. Please comment if you see more on this.