Monday, August 14, 2017

Azure SQL Data Warehouse - Part II - Analyzing Created Databases

I started writing on this long time back but had no way of completing it. My first post on this was Azure SQL Data Warehouse - Part I - Creating and Connecting and here is the second part of it.

The first post discussed about Azure SQL Data Warehouse and how to create a server and database using the portal. As the next step, let's talk about the architecture of it bit and then see how Azure data warehouse maintains and processes data.

Control Node

Azure SQL Data Warehouse is a distributed database. It means that data is distributed in multiple locations. However, once the data warehouse is created, we will be connecting with one component called Control Node. It is not exactly a SQL Server database but when connecting to it, it looks and feels like connecting to a SQL Server Database. Control node handles all communication and computation. When we make a request to the data warehouse, Control node accepts it, determines the way it should be distributed based on divide and conquer approach, get it processed and finally send the result to us.

Compute Node

Control node get the data processed in parallel using Compute Nodes. They are SQL Server databases and store all our records. Based on the number of DWU configured, data warehouse is set with one or more Compute Nodes.

Distribution

Data related to the data warehouse is stored in Azure Blob Storage and distributed in multiple locations. It is independent from Compute Nodes, hence they can be operated/adjusted independently. These locations are called as Distributions. The number of distributions for an Azure SQL data warehouse is a fixed number that is sixty (60). These distributions are assigned dynamically to Compute Nodes and when a query is executed, each distribution processes data related to them. This is how the parallel execution happens.

If you need more compute power, you can increase the number of Compute Nodes by increasing DWUs. When the number of Compute Nodes are getting changed, the number of distributions per Compute Node is getting changed as well.

Architecture

This image shows the architecture when you create an Azure SQL Data Warehouse with 100 DWU.


This image shows the architecture when you create an Azure SQL Data Warehouse with 400 DWU.


Let's create two databases and clarify this configuration.

I have discussed all steps related to server creation and database creation in my first post that is  Azure SQL Data Warehouse - Part I - Creating and Connecting , hence I am not going to discuss the same here. Note the image below. That is how I have created two databases for checking the configurations.


As you see, the first data warehouse is created using 100 DWUs and second one with 400 DWUs. Let's see how nodes have been created for these two databases. We can simply use sys.dm_pdw_nodes DMV for getting this information.

SELECT * FROM sys.dm_pdw_nodes;

SELECT type, Count(*) NofNodes
FROM sys.dm_pdw_nodes
GROUP BY type;

Result with the first data warehouse that is created using 100 DWUs.


Result with the second data warehouse that is created using 400 DWUs.


Note the second data warehouse. Since we used more DWUs, it has been created with four Compute Nodes that gives better performance than the first one. Since this is a sample database and it has tables, we can check one of the tables and see how data is distributed with distributions.

The following code shows the distributions created for one data warehouse. As mentioned above, it is always 60.


Here is the code for seeing how rows of a table are distributed in distributions with the second data warehouse created. Note how each distributions are assigned to Compute Nodes.


Records are distributed based on the design of the table. Azure SQL Data Warehouse uses two types of distributions: Round Robin and Hash distributions. Let's talk about it with the next post.

Friday, August 4, 2017

bcp error - Invalid column type from bcp client for colid 1 - Solution

You might be experiencing the following error with bcp when loading text data to SQL Server table;


It is possible to see this error when there is a data type mismatch between a column related to the destination table and a field coming from the file. However, there can be a situation where you have used a view for the bcp and underline table has been changed. If so, you may see the same error.

In my case, my destination table has an identity column, hence I created a view for getting data via bcp. I started seeing this error after making some modifications to the table. Once the view is dropped and recreated, it worked without any issue.

Thursday, August 3, 2017

Script for creating DimTime Dimension Table and populating data

The posts I made on date dimension table (Script for populating Date Dimension with Financial Year and Loading Data Script for Snowflake type Date Dimension) had a good response and few asked about the same for Time related table. This is one of the old scripts I used for populating Time Dimension, it maintains up to the second but if you need more, the script can be easily change.

Here is the code for creating the table and populating data.

CREATE TABLE dbo.Time
(
 TimeKey int identity(1,1) 
 , Constraint pk_Time primary key (TimeKey)
 , [Hour] smallint not null
 , [24Hour] smallint not null
 , [Minute] smallint not null
 , [Second] smallint not null
 , [Time] char(10) not null
 , [24Time] char(8) not null
 , HourBucket char(9)
);
GO

DECLARE @Hour smallint = 0
DECLARE @Minute smallint = 0
DECLARE @Second smallint = 0

WHILE (@Hour < 24)
BEGIN

 WHILE (@Minute < 60)
 BEGIN

  WHILE (@Second < 60)
  BEGIN

   INSERT INTO dbo.[Time]
    ([Hour], [24Hour], [Minute], [Second], [Time], [24Time], HourBucket)
   VALUES
    (CASE WHEN @Hour > 12 THEN @Hour - 12 ELSE @Hour END
    , @Hour, @Minute, @Second
    , REPLICATE('0', 2 - LEN(CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END)) + CONVERT(varchar(2), CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
     + CASE WHEN @Hour > 12 THEN 'PM' ELSE 'AM' END
    , REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
    , '[' + REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ' - ' 
     + REPLICATE('0', 2 - LEN(@Hour+1)) + CONVERT(varchar(2), @Hour+1) + ']')
   SET @Second += 1;
  END
  SET @Second = 0;
  SET @Minute += 1;
 END
 SET @Minute = 0;
 SET @Hour += 1;
END
GO

SELECT * FROM dbo.[Time];


Wednesday, August 2, 2017

How to reset IDENTITY in Memory-Optimized Tables

Even though SEQUENCE object is available with much more flexibility, we still use IDENTITY property for adding sequence values to tables, specifically when we need to introduce a surrogate key. However, if you need the same with Memory-Optimized tables, you need to know certain things.

Can I add the IDENTITY property to Memory-Optimized tables?
Yes, it is possible but it should be always IDENTITY(1,1). You cannot use a different values for seed and increment, they should be always set as 1.

CREATE TABLE dbo.MemoryOptimizedTable
(
 Id int IDENTITY(1,1) NOT NULL primary key nonclustered,
 CurrentDate datetime NULL default (getdate())
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO 5

Can I reset the IDENTITY seed using DBCC CHECKIDENT?
No, this is not supported. If you need to reset the IDENTITY seed, only way is inserting a new value explicitly by turning SET IDENTITY_INSERT on. As shown in the example, once the record with value 100 is inserted, the next value of the seed is set to 100+1.

SET IDENTITY_INSERT dbo.MemoryOptimizedTable ON
GO
INSERT INTO dbo.MemoryOptimizedTable (Id, CurrentDate) VALUES (100, DEFAULT);
GO
SET IDENTITY_INSERT dbo.MemoryOptimizedTable OFF
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO

SELECT * FROM dbo.MemoryOptimizedTable;


What if insert a lower value explicitly?
It is possible as long as it does not violate any rules. Look at the code below. It inserts a record with value 50 explicitly. But it does not mean that the seed is getting reset to 50+1. This does not happen because the last generated value is greater than 50+1. Therefore, the value of next record is 102, not 51.

SET IDENTITY_INSERT dbo.MemoryOptimizedTable ON
GO
INSERT INTO dbo.MemoryOptimizedTable (Id, CurrentDate) VALUES (50, DEFAULT);
GO
SET IDENTITY_INSERT dbo.MemoryOptimizedTable OFF
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO

SELECT * FROM dbo.MemoryOptimizedTable;


Tuesday, August 1, 2017

Script for populating Date Dimension with Financial Year

Once I published the code I used for creating date dimension tables and populating data. But it had not handled finance year related date elements. Generally we handle financial year elements with Date Table with Star Schema not with Snowflake Schema. And in most cases (or old days), we usually use OLAP Data Warehouse  as the source for reporting and Analysis Services can easily handle financial dates with Start Schema implemented in the Relational Data Warehouse. However, in modern world, we try to make the solution just using the Relation Data Warehouse without using Multi-dimensional models (or using Tabular Models), thought make a another script for handling financial dates with Snowflake Schema.

With this script, four tables are created: Year, Quarter, Month and Date. If need to see elements related to the financial year, the link between Year and Quarter should be set with Year.YearKey and Quarter.FinancialYearKey. For calendar date elements, the link should be between Year.YearKey and Quarter.YearKey.

This is how you make a search for financial dates;


This is how you make a search for Calendar dates;


Here is the script for creating tables;

CREATE TABLE dbo.DimYear
(
 YearKey smallint 
 , Year smallint not null
 , Constraint pk_DimYear Primary Key (YearKey)
);
GO

CREATE TABLE dbo.DimQuarter
(
 QuarterKey smallint 
 , YearKey smallint not null
 , FinancialYearKey smallint not null
 , Quarter smallint not null
 , YearQuater varchar(20) not null
 , QuarterName varchar(20) not null
 , FinancialQuarter smallint not null
 , FinancialYearQuarter varchar(20) not null
 , Constraint pk_DimQuarter Primary Key (QuarterKey)
 , Constraint fk_DimQuarter_DimYear Foreign Key (YearKey)
  References dbo.DimYear (YearKey)
 , Constraint fk_DimQuarter_DimYear_Financial Foreign Key (FinancialYearKey)
  References dbo.DimYear (YearKey)
);
GO

CREATE TABLE dbo.DimMonth
(
 MonthKey int 
 , QuarterKey smallint not null
 , MonthNumber smallint not null
 , MonthName varchar(20) not null
 , YearMonth varchar(20) not null
 , MonthShortName char(3) not null
 , FinancialMonthNumber smallint not null
 , Constraint pk_DimMonth Primary Key (MonthKey)
 , Constraint fk_DimMonth_DimQuarter Foreign Key (QuarterKey)
  References dbo.DimQuarter (QuarterKey)
);
GO

CREATE TABLE dbo.DimDate
(
 DateKey int 
 , MonthKey int not null
 , Date date not null
 , WeekDayNumber smallint not null
 , WeekDayName varchar(20) not null
 , DayOfMonth smallint not null
 , DayOfYear smallint not null
 , IsWeekend bit not null
 , IsHoliday bit not null
 , WeekNumberOfYear smallint not null
 , Constraint pk_DimDate Primary Key (DateKey)
 , Constraint fk_DimDate_DimMonth Foreign Key (MonthKey)
  References dbo.DimMonth (MonthKey)
);
GO

Here is the script for populating dates;

SET DATEFIRST 1;

DECLARE @StartDate date = '1990-01-01';
DECLARE @EndDate date = '2019-12-31';
DECLARE @FinancialYearStartingQuarter smallint = 3; -- Starting from July, If it starts from April then 2

DECLARE @YearKey smallint;
DECLARE @Quarter smallint, @QuarterKey smallint;
DECLARE @Month int, @MonthKey int;
DECLARE @Date int, @DateKey int;

DECLARE @FinancialStartingMonth smallint;
SET @FinancialStartingMonth = CASE @FinancialYearStartingQuarter WHEN 1 THEN 1 WHEN 2 THEN 4 WHEN 3 THEN 7 WHEN 4 THEN 10 END

INSERT INTO dbo.DimYear 
 (YearKey, [Year]) 
VALUES 
 (YEAR(@StartDate) - 1, YEAR(@StartDate) - 1);


WHILE (@StartDate <= @EndDate)
BEGIN

 -- Inserting years
 SET @YearKey = YEAR(@StartDate);
 IF NOT EXISTS (SELECT * FROM dbo.DimYear WHERE YearKey = @YearKey)
  INSERT INTO dbo.DimYear (YearKey, [Year]) 
  VALUES 
  (@YearKey, @YearKey);


 -- Inserting quarters
 SET @QuarterKey = Convert(smallint, Convert(varchar(4), YEAR(@StartDate)) + Convert(varchar(1), DATEPART(q, @StartDate)))
 SET @Quarter = DATEPART(q, @StartDate);

 IF NOT EXISTS (SELECT * FROM dbo.DimQuarter WHERE QuarterKey = @QuarterKey)
 INSERT INTO dbo.DimQuarter 
  (QuarterKey, YearKey
  , FinancialYearKey
  , [Quarter], YearQuater, QuarterName
  , FinancialQuarter, FinancialYearQuarter) 
 VALUES (@QuarterKey, @YearKey
  , CASE WHEN @Quarter < @FinancialYearStartingQuarter THEN @YearKey -1 ELSE @YearKey END
  , @Quarter, Convert(varchar(4), YEAR(@StartDate)) + 'Q' + Convert(varchar(4), @Quarter)
  , CASE @Quarter WHEN 1 THEN 'First Quarter' WHEN 2 THEN 'Second Quarter' WHEN 3 THEN 'Third Quarter' ELSE 'Forth Quarter' END
  , CASE @Quarter WHEN 1 THEN 3 WHEN 2 THEN 4 WHEN 3 THEN 1 WHEN 4 THEN 2 END 
  , CASE @Quarter WHEN 1 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q3'
      WHEN 2 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q4'
      WHEN 3 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q1'
      WHEN 4 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q2'END
  );
  

 ---- Inserting months
 SET @MonthKey = Convert(int, Convert(varchar(4), YEAR(@StartDate)) + (CASE LEN(MONTH(@StartDate)) WHEN 1 THEN '0' ELSE '' END) + Convert(varchar(2), MONTH(@StartDate)));
 SET @Month = MONTH(@StartDate)
 IF NOT EXISTS (SELECT * FROM dbo.DimMonth WHERE MonthKey = @MonthKey)
  INSERT INTO dbo.DimMonth 
   (MonthKey, QuarterKey, MonthNumber, MonthName, YearMonth, MonthShortName
   , FinancialMonthNumber) 
  VALUES 
   (@MonthKey, @QuarterKey, @Month, DATENAME(MONTH, @StartDate)
   , Convert(varchar(4), YEAR(@StartDate)) + ' ' + LEFT(DATENAME(MONTH, @StartDate), 3)
   , LEFT(DATENAME(MONTH, @StartDate), 3)
   , CASE 
     WHEN @FinancialStartingMonth = 1 THEN @Month
     WHEN @FinancialStartingMonth = 4 AND @Month  < @FinancialStartingMonth THEN @Month + 9
     WHEN @FinancialStartingMonth = 4 AND @Month  >= @FinancialStartingMonth THEN @Month - 3
     WHEN @FinancialStartingMonth = 7 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 7 AND @Month  >= @FinancialStartingMonth THEN @Month - 6
     WHEN @FinancialStartingMonth = 10 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 10 AND @Month  >= @FinancialStartingMonth THEN @Month + 3
    END
   );
  
  ---- Inserting dates
  SET @DateKey = Convert(int, Convert(varchar(8), @StartDate, 112))
  IF NOT EXISTS (SELECT * FROM dbo.DimDate WHERE DateKey = @DateKey)
   INSERT INTO dbo.DimDate (DateKey, MonthKey, Date, WeekDayNumber, WeekDayName, DayOfMonth
    , DayOfYear, IsWeekend, IsHoliday, WeekNumberOfYear) 
   VALUES (@DateKey, @MonthKey, @StartDate, DatePart(WEEKDAY, @StartDate), DateName(WEEKDAY, @StartDate), DatePart(DAY, @StartDate)
    , DatePart(DAYOFYEAR, @StartDate), CASE WHEN DatePart(WEEKDAY, @StartDate) IN (6, 7) THEN 1 ELSE 0 END, 0
    , DatePart(WEEK, @StartDate));

  SET @StartDate = DATEADD(dd, 1, @StartDate);
END;
GO