Dinesh's Blog :::: Being Compiled ::::

Sunday, July 17, 2016

SQL Server Brain Basher of the Week #047 - Database diagram

This is something I asked during an interview. While we were talking about system databases, I questioned interviewee that how get information related to tables exist in system databases. The answer was "Simple, we can make a database diagram and can use it for understanding structures and relationships". Can we really do it?

Here is the question for the week? Can we create database diagrams on system databases?

The answer is simple, but not the answer given by interviewee. It is not possible. SQL Server does not allow us to generate diagram or script system tables. Even if you try to use Generate Script it will not list out system tables.

See, how it appears when trying to script the msdb database. As you can see, there is no options to get even script generated on system tables.

Sunday, July 10, 2016

SQL Server Brain Basher of the Week #046 - Default Database Size

You may or may not have noticed the initial size of the database at creation but there is a fixed size for both data and log files. Most of the time, developers just create the database by supplying the name without considering other properties unless some properties need an attention and a change. Same goes to the size too. Let's talk about the initial size today;

What is the default size of SQL Server database?

Before discussing the answer, let's see how a database is created with our CREATE DATABASE command. When SQL Server has to create a database, it actually takes a copy of Model database as it is considered as the standard template for databases. Therefore default size of the database is based on the size set with Model database.

Until SQL Server 2014, the default size of the data file is 3MB and log file is 1MB. However this has been changed from SQL Server 2016. Now it is 8MB for both files. Not only that, File Growth setting has been changed as well.

Friday, July 8, 2016

Conformed dimensions and Stovepipe data marts

Long time back, I wrote a small note on Stovepipe data marts, thought make a blog post again on it as it is still unknown to many. The simplest way of remembering What is Stovepipe Data Mart, is remembering a data mart that cannot be either extended to or integrated to Data Warehouse. This happens if your data mart does not contain Conformed Dimensions.

How do we know whether we design conformed dimensions?

Most of the entities identified during the requirement gathering process are common to entire organization. Example, Product, Employee, Customer are some of the entities. Although these entities have many attributes, only few attributes will be taken into the account during the design of data marts. Not only that, some attributes may not be correctly identified and designed wrongly though the wrong design satisfy the current requirement up to some extent. This does not lead you to conformed dimensions.

In order to make sure that the dimension designed is a conformed dimension, make sure that all related and known attributes are correctly added. There can be a situation that you need only few attributes to the current data mart, but if you have identified many attributes, make sure everything is added. When you add them, make sure that correct business logic related to them is implemented. This makes the dimension as a conformed dimension. Example, if yo u have designed a dimension called Product for Sales data mart and you have added all possible attributes and implemented all possible business logic, then it is a conformed dimension and you will be able to use it when you design Marketing or Finance data marts.

Coming back to Stovepipe data marts, simply if you do not have conformed dimensions, it can be considered as a Stovepipe data mart. You will not be able to integrate the data mart with data warehouse without doing modifications to the data mart, which will be costly, or which will be impossible considering factors like resource availability, system usage and budget constraint. Therefore make sure that you are not leading to a Stovepipe data mart.

Thursday, July 7, 2016

Azure SQL Database - Cannot connect to database.windows.net.

Are you experience this error when trying to connect with your Azure SQL Database?

TITLE: Connect to Server
------------------------------

Cannot connect to database.database.windows.net.

------------------------------
ADDITIONAL INFORMATION:

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server) (Microsoft SQL Server, Error: 53)

For help, click: http://go.microsoft.com/fwlink?ProdName=Microsoft%20SQL%20Server&EvtSrc=MSSQLServer&EvtID=53&LinkId=20476

------------------------------

The network path was not found

There can be various reasons for this, but other than general technical reasons, you need to check one thing if you have connected to another database before, which is not exist now.

For example, you might have taken a snapshot of your database and connected to it using Management Studio. For connecting, you need to go through Options and set the database name with it. Once set, SSMS keeps, and next time when you try to log in, even though u do not notice, it might try to connect with the Snapshot, which you connected before. If it exist, it will connect, else it will throw the above one without giving you much details.

So, check that before going into technical analysis.

Wednesday, July 6, 2016

SQL Server Backup to URL - Now easier than before

I made a post on SQL Server Backup to URL that describes how to take a backup to URL. Earlier, this needs the credential created before you take a backup but now, creating credentials using Policy on the container and Shared Access Signature Key (see more this with SQL Server On-Premise Database files maintained in Azure) can be done with the given GUI. This saves time and not only that, it clears the confusion had with previous GUI.

The new GUI is come with SQL Server Management Studio July Preview.

This is what we had before;

Now the interface with latest Management Studio;

If you have already created a credential, it can be picked from the first drop-down. If not, click on New container button and Sign in to your Azure Account. Once signed in, select the Storage and Container.

You can generate the Shared Access Signature by clicking Create Credential button, and complete it.

Now you can select the Azure storage container and take the back.

Tuesday, July 5, 2016

SQL Server Management Studio July 2016 is available

Good thing is we get continuous releases for Management Studio. Microsoft has released the newest version of SQL Server Management Studio which is July 2016 release. Latest version is 13.0.15500. The June 2016 release was 13.0.15000.

You can download the latest from here.

Some of the major changes are as follows (quoted from the same page);

Support for Azure SQL Data Warehouse in SSMS.
Significant updates to the SQL Server PowerShell module. This includes a new SQL PowerShell module and new CMDLETs for Always Encrypted, SQL Agent, and SQL Error Logs.
Support for PowerShell script generation in the Always Encrypted wizard.
Significantly improved connection times to Azure SQL databases.
New ‘Backup to URL’ dialog to support the creation of Azure storage credentials for SQL Server 2016 database backups. This provides a more streamlined experience for storing database backups in an Azure storage account.
New Restore dialog to streamline restoring a SQL Server 2016 database backup from the Microsoft Azure storage service.

Monday, July 4, 2016

SQL Server Query Editor with Query Window

Some prefer write the TSQL statement without a help from an editor but some prefer to use an editor because it simply allows us to select tables and columns easily and make joins, grouping and filtering easily. I personally prefer to write statements by myself as it helps me to remember the way of writing it.

Those who like to get a help from editor for writing SELECT statements, they go to the View node in Object Explorer and use New View menu item for opening the editor. It allows us to construct the SELECT statement without writing it and the statement can be used without creating the view. However, there is a way of opening the editor with New Query Window without opening it with Create View.

Do you know that there is a menu item called Design Query in Editor... in Query menu? This menu item opens the editor that can be used for constructing the query and adding to your New Query window.

Although this is not a new menu item, many do not use it, in fact many do not know that it is available. It is usefull, you can get your statement written without spending much time.

Sunday, July 3, 2016

Scaling out the Business Intelligence solution

In order to increase the scalability and distribute workload among multiple servers, we can use a scale-out architecture for our DW/BI solution by adding more servers to each BI component. A typical BI solution that follows Traditional Business Intelligence or Corporate Business Intelligence uses following components that come in Microsoft Business Intelligence platform;

The data warehouse with SQL Server database engine
The ETL solution with SQL Server Integration Services
The models with SQL Server Analysis Services
The reporting solution with SQL Server Reporting Services

Not all components can be scaled out by default. Some support natively some need additional customer implementations. Let's check each component and see how it can be scaled out;

The data warehouse with SQL Server database engine

If the data warehouse is extremely large, it is always better to scale it out and scaling up. There are several ways of scaling out. Once common way is Distributed Partitioned Views. It allows us to distribute data among multiple databases that are either in same instance or different instances, and have a a middle layer such as a View to redirect the request to right database.

Few more ways are discussed in here that shows advantages and disadvantages of each way.

In addition to that Analytics Platform System (APS) which was formerly named as Parallel Data Warehouse (PDW) is available for large data warehouses. It is based on Massively Parallel Processing architectures (MPP) instead of Symmetric Multiple Processing (SMP), and it comes as an appliance. This architect is called as Shared-Nothing architecture as well and it supports distributing data and queries to multiple compute and storage nodes.

The ETL solutions with SQL Server Integration Services

There is no built-in facility to scale out Integration Services but it can be done by distributing packages among servers manually, allowing them to get executed parallel. However this requires extensive custom logic with packages and maintenance cost might be increased.

The models with SQL Server Analysis Services

Generally, this is achieved using multiple Analysis Services Query servers connected to read-only copy of multi-dimensional database. This requires a load-balancing middle tier component that handles requests of clients and direct them to right query server.

The same architecture can be slightly changed and implement in different ways as per requirements. Read following articles for more info.

https://technet.microsoft.com/en-us/library/ff795582(v=sql.100).aspx

http://byobi.com/blog/2014/06/ssas-scale-out-architectures/

The reporting solution with SQL Server Reporting Services

The typical way of scaling out Reporting Services is, configure a single report database with multiple report server instances connected to the same report database. This separates report execution and rendering workload from database workloads.

Saturday, July 2, 2016

Connecting to the Integration Services - "Access is denied."

If required permissions are not configured for Integration Services for your account, you will experience following error when trying to connect with Integration Services;

Connecting to the Integration Services service on the computer "" failed with the following error: "Access is denied."

This clearly indicates that you need permission on it. How can I add required permissions for any account?

Here are the steps for giving required permission;

1. Search for Component Services and open it.

2. Expand Component Services -> Computers -> My Computer -> DCOM Config.

3. Find Microsoft SQL Server Integration Services, right click on it and get Properties. Go to Security tab and add accounts you need with relevant permissions for Launch and Activation Permissions, Access Permissions, and Configuration Permissions by clicking Edit button.

Click OK to save all. You need to restart the Integration Services service for getting added permission applied. Once restarted, you will be able to connect with it without any issue.

Friday, July 1, 2016

SQL Server 2016 Sample Database - Wide World Importers- Restoring and adding more records

We have been using Adventure Works database with almost all versions of SQL Server and no questions, it can be used with SQL Server 2016 as well. Everyone likes Adventure Works but when a complex testing over functionalities is required, Adventure Works is considered as a simple database and many tend to use some other databases instead of Adventure Works. One popular database was Contoso database which was available but for most of cases, the size was not enough.

For SQL Server 2016, Microsoft introduces a new database called Wide World Importers. This database covers many new features of SQL Server and it allows you to load data to current date. Not only that, you decide how many Customer Orders you need per day to increase the volume.

Here is the way of adding Wide World Importers databases.

WideWorldImporters Database

Go to this link and download WideWorldImporters-Full.bak file. Restore it either using TSQL or GUI.

USE master
RESTORE DATABASE WideWorldImporters
FROM  DISK = N'D:\Software\SQL Server 2016\WideWorldImporters-Full.bak' 
WITH  FILE = 1,  
MOVE N'WWI_Primary' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImporters.mdf',  
MOVE N'WWI_UserData' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImporters_UserData.ndf',  
MOVE N'WWI_Log' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImporters.ldf',  
MOVE N'WWI_InMemory_Data_1' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImporters_InMemory_Data_1',  
NOUNLOAD,  STATS = 5
GO

Once restored, you should see the database in your instance. If you query Sales.Orders table, you should see 73,595 records. These orders are populated till 2016-05-31.

If you want to add more data to the current data, you can call DataLoadSimulation.PopulateDataToCurrentDate stored procedure with relevant parameter values as below.

DECLARE @return_value int

EXEC @return_value = [DataLoadSimulation].[PopulateDataToCurrentDate]
  @AverageNumberOfCustomerOrdersPerDay = 80,
  @SaturdayPercentageOfNormalWorkDay = 30,
  @SundayPercentageOfNormalWorkDay = 5,
  @IsSilentMode = 0,
  @AreDatesPrinted = 0

SELECT 'Return Value' = @return_value

You can simply have a larger number of @AverageNumberOfCustomerOrdersPerDay if you need more records.

WideWorldImportersDW database
If you need the relational data warehouse related to Wide World Importers, then you need to download the WideWorldImportersDW-Full.bak from same page. Restore it just like the previous database using following script;

USE master
GO

RESTORE DATABASE WideWorldImportersDW
FROM  DISK = N'D:\Software\SQL Server 2016\WideWorldImportersDW-Full.bak' 
WITH  FILE = 1,  
MOVE N'WWI_Primary' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImportersDW.mdf',  
MOVE N'WWI_UserData' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImportersDW_UserData.ndf',  
MOVE N'WWI_Log' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImportersDW.ldf', 
MOVE N'WWIDW_InMemory_Data_1' TO N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\WideWorldImportersDW_InMemory_Data_1',  
NOUNLOAD,  STATS = 5
GO

If you have repopulated WideWorldImporters database using PopulateDataToCurrentDate, then you need to transfer new data to the data warehouse. You do not need to manually load it, you can use a given package for populating the data warehouse.

Download Daily.ETL.ispac from the same page and add it to Integration Services catalog by running the file. Once it is added, you should see the package;

Right-click on the package and execute it for updating the data warehouse. Make sure connection string set are correct as per your environment. If not, change it and execute it.

Once executed, your data warehouse is updated based on new data added to WideWorldImporters database.

Sunday, May 29, 2016

SQL Server Brain Basher of the Week #045 - Stored Procedure or Function

A question related to Stored Procedure is very common during an interview and many get stuck with it when it has to be discussed in details. What everyone thinks is, Stored Procedure is a pre-compiled code just like and EXE (Of course it is not) and because of that it can execute the logic added fast. Yes, it is true that you get some performance benefits but it is not the right way to think about it. What if a scenario like below is given and a question is asked.......

Assume that you need to implement a business logic that suppose to do some calculation based on inputs given. The result is always a single integer value and the code of the implementation is very complex. Logic requires few tables to be read and do some complex calculations. Should I use a Stored Procedure for this or a Function?

Generally, If I am the interviewer, I do not expect just an answer saying "Stored Procedure" or "Function". I prefer a lengthy answer explaining why we should go for either Stored Procedure or Function.

Let's see some key points for making the decision. Considering the return value and inputs, it looks like it is better to use a function than a stored procedure. Because the implementation is similar to a scalar function. You need to get more details on the logic that needs to be implemented. If it needs loading data into temporary containers, remember, function does not support temporary tables but table variables. And we need to know how it going to be used. Is it going to be used with SELECT statements? If so, you know that stored procedure cannot be used with SELECT statement, specifically with columns or WHERE clause. As per the requirement, it looks like it does not require to do INSERT, UPDATE or DELETE. Do not forget, these statements are not allowed inside functions.

Not only that, though you do not need to consider with the given scenario, remember, function does not support transactions and TRY/CATCH blocks.

This is what we expect from the interviewee. If interviewee can explain the answer with details, it forced the interviewer to think as He is smart. So be smart!

Saturday, May 28, 2016

Browsing internet inside SQL Server Management Studio

How often you google or bing on technical issues while developing modules with SQL Server? For me, it is quite often as I always do searches on codes I write. Now the question is, do you go out from your Management Studio for browsing internet?

Do you know that you can browse internet inside the SSMS? Yes, it is possible and it is not something newly added with SQL Server 2016. It was available even with SQL Server 2012 :).

This is really simple. You have been given an icon in the toolbar for opening an interface for browsing and then you can work with it as you want.

In addition to that, we have been given two shortcuts: Community Projects and Samples and Resource Center that help you to get more info on SQL Server whenever wanted.

Friday, May 27, 2016

SET ROWCOUNT overrides TOP?

While I was going through an implemented set of codes, I noticed that ROWCOUNT is heavily used (of course, it is an old implementations) and not only that, TOP is with newly added codes. There is an interesting behavior when these two are combined, which was one of the issues I was troubleshooting. Let me explain with an example.

First of all, remember, there is a Behavior Change with ROWCOUNT in SQL Server future release. There will be NO effect on INSERT, DELETE and UPDATE statement with ROWCOUNT in SQL Server future release. Therefore avoid ROWCOUNT with these action statements, instead use TOP.

Have a look on below query. It sets ROWCOUNT to 100, instructing that stop processing the query after 100 rows are returned. However, TOP is used with SELECT, limiting records to 10 records. Because of that we get only 10 records. This proves that TOP overrides ROWCOUNT.

USE AdventureWorks2016CTP3;
GO

SET ROWCOUNT 100;
SELECT TOP (10) *
FROM Sales.SalesOrderHeader;

Now let's change the query bit. Let's instruct SQL Server to process only 10 records using ROWCOUNT and have TOP 100 with SELECT statement.

USE AdventureWorks2016CTP3;
GO

SET ROWCOUNT 10;
SELECT TOP (100) *
FROM Sales.SalesOrderHeader;

What we expect is 100 records as we know that TOP overrides ROWCOUNT but it returns only 10.

What is the reason? This is how it works when ROWCOUNT and TOP are combined. ROWCOUNT overrides TOP keyword in the SELECT only if the set value for ROWCOUNT is smaller that value set with TOP. This is something you need to remember. Best is, try to use only TOP without using ROWCOUNT.

Wednesday, May 25, 2016

SQL Server SESSION_CONTEXT function - SQL Server 2016 Enhancements

If you are a web developer and have worked with Classic ASP or ASP.Net, then you know that how important it is for maintaining Session Variables for keeping various values throughout the session. Although similar capability was exist with SQL Server, it was not exactly functioning as Session Variables as it had many limitations. Now, with SQL Server 2016, a new function has been introduced, called SESSION_CONTEXT that accepts a key of key-value pair set with sp_set_session_context and returns the value of submitted key.

The previous versions before 2016, this was implemented via CONTEXT_INFO function. This holds only a single binary value, making it difficult to maintain multiple values. It is limited to 128 bytes per connection, which is not enough at all and cannot be secured because user can change it if want. With these limitations, it was not used with much implementations.

Values for sessions are set with sp_set_session_context that accepts two mandatory parameters and one optional. First parameter the key (or name) of the session variable which type is sysname. Second parameter is the value. It is accepted as sql_variant and value can be as large as 256KB. Third parameter which is optional that indicates whether the variable is read-only or not. The default is 0 which is not read-only.

Let's test this. The following code creates;

Create a Login called Smith for testing
Create a Database called Config. Assume that you maintain additional configuration details in this database for your applications.
Add Smith as a User to the database. He will be a member of db_datareader role.
Create a Table called EmployeeSecurityLevel and inserts two records.
Create a Procedure called GetProducts for testing. This procedure checks the set Security Level and displays. In addition to that, Smith is added to the database.
Create a Logon Trigger to capture user and set Security Level as a Session Variable. It assigned the relevant Security Level taken from Config database to a variable called CompanySecurityLevel.

-- ******** 1 ********
-- Create login for Smith
USE master;
GO
CREATE LOGIN Smith WITH PASSWORD = 'Pa$$w0rd';
GO

-- ******** 2 ********
-- Create a Config database
CREATE DATABASE Config;
GO

USE Config;
GO

-- ******** 3 ********
-- Add Smith to Config
CREATE USER Smith FOR LOGIN Smith;
EXEC sp_addrolemember 'db_datareader', 'Smith';

-- ******** 4 ********
-- Create a table for holding security levels
CREATE TABLE dbo.EmployeeSecurityLevel
(
 Id int Identity(1,1) PRIMARY KEY
 , UserId nvarchar(200) NOT NULL INDEX IX_EmployeeSecurityLevel_UserId NONCLUSTERED
 , SecurityLevel tinyint NOT NULL
);

-- Add two users
INSERT INTO dbo.EmployeeSecurityLevel
 (UserId, SecurityLevel)
VALUES
 (N'Dinesh', 1)
 , (N'Smith', 3);

 

-- ******** 5 ********
USE AdventureWorks2016CTP3;
GO

-- Add SP for getting products
DROP PROC IF EXISTS dbo.GetProducts;
GO
CREATE PROC dbo.GetProducts
AS
BEGIN

 DECLARE @SecurityLevel tinyint
 SELECT @SecurityLevel = Convert(tinyint, SESSION_CONTEXT(N'CompanySecurityLevel'));

 PRINT 'Displaying products as per security level.....'
 PRINT 'Security level: ' + Convert(varchar(3), @SecurityLevel);
END;

-- Add Smith and set relevant permission
CREATE USER Smith FOR LOGIN Smith;
EXEC sp_addrolemember 'db_datareader', 'Smith';
GRANT EXECUTE ON dbo.GetProducts TO Smith;

-- ******** 6 ********
USE master;
GO

-- Creating Logon trigger to set the session variable
CREATE TRIGGER SetSessionVariables
ON ALL SERVER 
AFTER LOGON
AS
BEGIN
 DECLARE @SecurityLevel tinyint

 -- Query Config database and get the value
 SELECT @SecurityLevel = SecurityLevel
 FROM Config.dbo.EmployeeSecurityLevel
 WHERE UserId = CURRENT_USER;

 -- Set the value, note that it is set as read-only
 EXEC sp_set_session_context N'CompanySecurityLevel', @SecurityLevel, 1; 
END;

Once everything is set, when Smith logs in, his Security Level is captured via the trigger and set with session_context. This value can be read in any database within his session. Below code shows that Smith executes the procedure and result is based on his Security Level.

-- Logged in as Smith
USE AdventureWorks2016CTP3;
GO
-- Smith executing the SP
EXEC dbo.GetProducts;

Monday, May 23, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II

With my previous post Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I, I discussed how to prepare the environment for processing unstructured data using SSIS. With that, I explained the key requirements for this;

Integration Services Feature Pack for Azure
Microsoft Hive ODBC Driver
Self-Signed certificate for adding Azure

Now let's see how we can create a SSIS package for handling the process. Assumption we made with part I is, you have a file that contains unstructured data. Let's say it is something like below;

Let's talk about a simple process for testing this. The above file is the famous file called davinci.txt, that is created with Project Gutenberg and used to demonstrate famous word count big data demo. So the assumption is, you have this file and you need to achieve word count from this file as part of your ETL process. In order to achieve this using SSIS with the help of HDInsight, following have to be done;

Upload the file to Azure Storage
Create the Hadoop Cluster on-demand (you can have it created if you are continuously using it)
Process the file using HiveQL for getting the word counts
Finallay, read the result into local database.

Let's start working on it. For uploading a file to Azure Storage, Azure Blob Upload Task that comes with Integration Services Feature Pack for Azure can be used. All it needs is a connection for the storage.

Create a SSIS project and have a package with a proper name. Drag Azure Blob Upload Task and drop on to Control Flow. Open its editor and create a new connection. New connection dialog box requires Storage account name and Account key. If you have a storage created in your Azure subscription, then access it and get the name and key1. If you do not have a storage, create it and then get them.

This is how you need to set the connection with SSIS.

In addition to the connection, you need to set following items with it;

Blob container - make sure you have a Container created in your storage. I use CloudXplorer for accessing my Azure storage and I can easily create containers and folders using it. You can do it with PowerShell, Visual Studio or any other third-party tool.
Blob directory - Destination. A folder inside the container. This folder is used for storing your file in Azure Storage.
Local directory - Source. Location of the file you keep davinci.txt file.

Once all set, task is ready for uploading files.

Next step is, adding Azure HDInsight Create Cluster task on to Control Flow. Drag it and drop and open the editor for configuring it. This requires Azure Subscription Connection which has to be created with following items;

Azure subscription ID - this can be easily seen with Settings when accessing the subscription via Classic Portal (see Part 1)
The Management certificate thumbprint - this is what we created with Part I and uploaded to Azure. This should be browsed in Local Machine location and My store.

Once the connection is created, you need to set other properties;

Azure Storage Connection - Use the same connection created for upload task.
Location - Use the same location used for the storage
Cluster name - Set a unique name for this. This is your HDI name
Cluster size - set number of nodes you want for your cluster
User name - set the user name of administrator of your HDI.
Password - set a complex password for the user.

Second task in the control flow is ready. Next task is for executing HiveQL query for processing data. I have some posts written on Hive: What is Hive, What is Hive Database, What is Hive Table?, How to create a Hive table and execute queries using HDInsight. Have a look on it if you are new to Hive. Azure HDInsight Hive Task is the one we have to use for processing data using HiveQL. Drag and drop it, and configure like below.

Azure subscription connection - Use the same connection created for above task.
HDInsight cluster name - Use the same name given with previous task.
Local log folder - Set a local folder for saving log files. This is really important for troubleshooting.
Script - You can either set HiveQL as an in-line script or you can have you script in a file saved in a storage, and refer it. I have added the query as an in-line script that does;

Create an external table called Words with one column called text.
Execute a query that aggregates data in Words and insert the result to WordCount table.

DROP TABLE IF EXISTS Words;
CREATE EXTERNAL TABLE Words
(
 text string
) row format delimited 
fields terminated by '\n' 
stored as textfile
location '/inputfiles/';
DROP TABLE IF EXISTS WordCount;
CREATE TABLE WordCount AS
SELECT word, COUNT(*) FROM Words LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

We have added a task for uploading the file (you can upload many files into the same folder), a task for creating the cluster and a task for processing data in added files. Next step is, accessing the table WordCount and get the result into local environment. For this, you need a DataFlow task. Inside the DataFlow, have an ODBC data source for accessing Hive table and a destination as you prefer.

Let's configure ODBC source. Drag and drop and set properties as below.

ODBC connection manager - Create a new connection using Hive ODBC connection created with Part I.
Data access mode - Select Table Name as HiveQL stores the resultset into a table called WordCount.
Name of the table or view - Select WordCount table from the drop-down.
Columns - Make sure it detects columns like below. Rename them as you want.

Note that you cannot configure the source if you have not created and populated Hive table. Therefore, before adding the DataFlow task, execute first three control flow tasks that upload the file, create the cluster, process data and save data into the table. Then configure the DataFlow task as above.

Add a Data Reader destination for testing this. You can add any type of transformations if you need to transform data further and send to any type of destination. Enable Data Viewer for seeing data.

Now you need to go back to Control Flow and add one more task for deleting the cluster you added. For that, drag and drop Azure HDInsight Delete Cluster Task and configure just like the way you configure other Azure tasks.

That is all. Now if you run the package, it will upload the file, create the cluster, process the file , get data into local environment, and delete the cluster as the final task.

This is how you use SSIS for processing unstructured data with the support of HDI. You need to know that creating HDI on-demand takes long time (than I expected, already checked with experts, waiting for a solution). Therefore you may create the cluster and keep it in Azure if the cost is not an issue.

Sunday, May 22, 2016

SQL Server Brain Basher of the Week #044 - Reporting Services

Let's talk about something related to Reporting Services. You are asked to install and configure Reporting Services in one of the servers given. It is a new server, no instances of SQL Server has been installed. You are asked not to configure anything manually, install it with default configurations. You start the installation and select Reporting Services - Native in Feature Selection page and continue the wizard. You want to make sure that wizard configure your reporting services but you notice that the option Install and configure is disable.

What is the reason for this?

If you ask wizard to install files and configure reporting services for you, it needs to create two database for you. Now the issue is, it cannot find SQL Server database engine in the same instance as you are installing a new instance and you have not selected Database Engine Services in Feature Selection page. Therefor wizard cannot create database required for configuring Reporting Services. If you have already installed SQL Server with database engine services and you are updating the same instance by adding Reporting Services, or if you have selected Database engine services in the Feature Selection page, then you will see this option is enable and you can continue with the wizard.

Saturday, May 21, 2016

Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part I

A fully-fledged Business Intelligence system never ignore unstructured data. The reason is, you can never get the true insight without considering, consuming and processing all types of data available in an organization. If you design a BI solution and if you have both structured and unstructured data, how do you plan to process them?

Generally, processing unstructured data is still belong to Hadoop ecosystem. It is designed for that, and it is always better to handover the responsibility to Hadoop. We do BI using Microsoft SQL Server product suite, and SQL Server Integration Services (SSIS) is the component we use for handling ETLing. If there is an unstructured data set that needs to be processed as a part of ETL process, how can you get the support from Hadoop via SSIS for processing unstructured data and getting them back as structured data? The solution is given with Integration Services Feature Pack for Azure.

The Integration Services Feature Pack for Azure (download from here) provides us functionalities for connecting with Azure Storage and HDInsight for transferring data between Azure Storage and On-Premise data sources, and processing data using HDInsight. Once installed, you see newly added tasks in both Control Flow and Data Flow as below;

Assume that you have a file which is unstructured and it needs to be processed. And once processed you need to get the result back to your data warehouse. For implementing this using SSIS, you need to do following;

Install Integration Services Feature Pack for Azure (download from here)
Install Microsoft Hive ODBC Driver (download from here)
Generate a Self-Signed Certificate and upload to Azure Subscription.

Why we need Microsoft Hive ODBC Driver? We need this for connecting with Hive using ODBC. There are two drivers; one for 32-bit applications and other is for 64-bit applications. In order to use it with Visual Studio, you need 32-bit driver. It is always better to have both installed and when creating DSN, create the same DSN in both 32-bit and 64-bit System DSN. This is how you have to configure DSN for Hive in Azure HDInsight.

Open ODBC Source Administrator (32-bit) application and click Add button in System DSN tab.

Select the driver as Microsoft Hive ODBC Driver and configure it as below;

You can click on Test button for testing the connection. If you have not configured the cluster yet (in my case, it is not, we will be creating using SSIS), you will get an error. But still you can save it keep it. Note that it is always better to create another DSN using 64-bit ODBC Data Source Administrator with the same name.

Next step is, generating the certificate and add it Azure subscription. This is required for connecting to Azure using Visual Studio. Easiest way of doing this is, creating the certificate using Internet Information Services (IIS), export it using certmgr and upload it using Azure Classic Portal. Let's do it.

Open IIS Manager and go for Server Certificates. Click on Create Self-Signed Certificate for generating a certificate.

Once created, open Manage Computer Certificates and start Export Wizard.

The wizard starts with welcome page and then open Export Private Key page. Select No, do not export private key option and continue.

Select Base-64 Encoded X.509 (.CER) option in Export File Format and continue.

Next page asks you the file name for the certificate. Type the same name with the location you need to save it and complete the wizard.

Certificate is exported with the required format. Now you need to upload it to Azure Subscription. Open Azure New Portal and then open Azure Classic Portal. Scroll down the items and fins Settings, and click on it.

Settings page has Manage Certificate section. Click on it for opening it and upload the certificate created.

Everything needs by SSIS for working with Azure Storage and HDI is done. Now let's make the SSIS package to upload an unstructured data, create a HDI cluster on-demand, process uploaded file using Hive, download the result, and then finally remove the cluster because you do not need to pay extra to Microsoft.

Since the post it bit lengthy, let me make the next part as a new post. Read it from below link:
Creating HDInsight Hadoop Cluster using SSIS and processing unstructured data using Hive Task - Azure Feature Pack - Part II