Monday, August 14, 2017

Azure SQL Data Warehouse - Part II - Analyzing Created Databases

I started writing on this long time back but had no way of completing it. My first post on this was Azure SQL Data Warehouse - Part I - Creating and Connecting and here is the second part of it.

The first post discussed about Azure SQL Data Warehouse and how to create a server and database using the portal. As the next step, let's talk about the architecture of it bit and then see how Azure data warehouse maintains and processes data.

Control Node

Azure SQL Data Warehouse is a distributed database. It means that data is distributed in multiple locations. However, once the data warehouse is created, we will be connecting with one component called Control Node. It is not exactly a SQL Server database but when connecting to it, it looks and feels like connecting to a SQL Server Database. Control node handles all communication and computation. When we make a request to the data warehouse, Control node accepts it, determines the way it should be distributed based on divide and conquer approach, get it processed and finally send the result to us.

Compute Node

Control node get the data processed in parallel using Compute Nodes. They are SQL Server databases and store all our records. Based on the number of DWU configured, data warehouse is set with one or more Compute Nodes.

Distribution

Data related to the data warehouse is stored in Azure Blob Storage and distributed in multiple locations. It is independent from Compute Nodes, hence they can be operated/adjusted independently. These locations are called as Distributions. The number of distributions for an Azure SQL data warehouse is a fixed number that is sixty (60). These distributions are assigned dynamically to Compute Nodes and when a query is executed, each distribution processes data related to them. This is how the parallel execution happens.

If you need more compute power, you can increase the number of Compute Nodes by increasing DWUs. When the number of Compute Nodes are getting changed, the number of distributions per Compute Node is getting changed as well.

Architecture

This image shows the architecture when you create an Azure SQL Data Warehouse with 100 DWU.


This image shows the architecture when you create an Azure SQL Data Warehouse with 400 DWU.


Let's create two databases and clarify this configuration.

I have discussed all steps related to server creation and database creation in my first post that is  Azure SQL Data Warehouse - Part I - Creating and Connecting , hence I am not going to discuss the same here. Note the image below. That is how I have created two databases for checking the configurations.


As you see, the first data warehouse is created using 100 DWUs and second one with 400 DWUs. Let's see how nodes have been created for these two databases. We can simply use sys.dm_pdw_nodes DMV for getting this information.

SELECT * FROM sys.dm_pdw_nodes;

SELECT type, Count(*) NofNodes
FROM sys.dm_pdw_nodes
GROUP BY type;

Result with the first data warehouse that is created using 100 DWUs.


Result with the second data warehouse that is created using 400 DWUs.


Note the second data warehouse. Since we used more DWUs, it has been created with four Compute Nodes that gives better performance than the first one. Since this is a sample database and it has tables, we can check one of the tables and see how data is distributed with distributions.

The following code shows the distributions created for one data warehouse. As mentioned above, it is always 60.


Here is the code for seeing how rows of a table are distributed in distributions with the second data warehouse created. Note how each distributions are assigned to Compute Nodes.


Records are distributed based on the design of the table. Azure SQL Data Warehouse uses two types of distributions: Round Robin and Hash distributions. Let's talk about it with the next post.

Friday, August 4, 2017

bcp error - Invalid column type from bcp client for colid 1 - Solution

You might be experiencing the following error with bcp when loading text data to SQL Server table;


It is possible to see this error when there is a data type mismatch between a column related to the destination table and a field coming from the file. However, there can be a situation where you have used a view for the bcp and underline table has been changed. If so, you may see the same error.

In my case, my destination table has an identity column, hence I created a view for getting data via bcp. I started seeing this error after making some modifications to the table. Once the view is dropped and recreated, it worked without any issue.

Thursday, August 3, 2017

Script for creating DimTime Dimension Table and populating data

The posts I made on date dimension table (Script for populating Date Dimension with Financial Year and Loading Data Script for Snowflake type Date Dimension) had a good response and few asked about the same for Time related table. This is one of the old scripts I used for populating Time Dimension, it maintains up to the second but if you need more, the script can be easily change.

Here is the code for creating the table and populating data.

CREATE TABLE dbo.Time
(
 TimeKey int identity(1,1) 
 , Constraint pk_Time primary key (TimeKey)
 , [Hour] smallint not null
 , [24Hour] smallint not null
 , [Minute] smallint not null
 , [Second] smallint not null
 , [Time] char(10) not null
 , [24Time] char(8) not null
 , HourBucket char(9)
);
GO

DECLARE @Hour smallint = 0
DECLARE @Minute smallint = 0
DECLARE @Second smallint = 0

WHILE (@Hour < 24)
BEGIN

 WHILE (@Minute < 60)
 BEGIN

  WHILE (@Second < 60)
  BEGIN

   INSERT INTO dbo.[Time]
    ([Hour], [24Hour], [Minute], [Second], [Time], [24Time], HourBucket)
   VALUES
    (CASE WHEN @Hour > 12 THEN @Hour - 12 ELSE @Hour END
    , @Hour, @Minute, @Second
    , REPLICATE('0', 2 - LEN(CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END)) + CONVERT(varchar(2), CASE WHEN @Hour > 12 THEN @Hour - 1 ELSE @Hour END) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
     + CASE WHEN @Hour > 12 THEN 'PM' ELSE 'AM' END
    , REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ':' + REPLICATE('0', 2 - LEN(@Minute)) + CONVERT(varchar(2), @Minute) 
     + ':' + REPLICATE('0', 2 - LEN(@Second)) + CONVERT(varchar(2), @Second) 
    , '[' + REPLICATE('0', 2 - LEN(@Hour)) + CONVERT(varchar(2), @Hour) 
     + ' - ' 
     + REPLICATE('0', 2 - LEN(@Hour+1)) + CONVERT(varchar(2), @Hour+1) + ']')
   SET @Second += 1;
  END
  SET @Second = 0;
  SET @Minute += 1;
 END
 SET @Minute = 0;
 SET @Hour += 1;
END
GO

SELECT * FROM dbo.[Time];


Wednesday, August 2, 2017

How to reset IDENTITY in Memory-Optimized Tables

Even though SEQUENCE object is available with much more flexibility, we still use IDENTITY property for adding sequence values to tables, specifically when we need to introduce a surrogate key. However, if you need the same with Memory-Optimized tables, you need to know certain things.

Can I add the IDENTITY property to Memory-Optimized tables?
Yes, it is possible but it should be always IDENTITY(1,1). You cannot use a different values for seed and increment, they should be always set as 1.

CREATE TABLE dbo.MemoryOptimizedTable
(
 Id int IDENTITY(1,1) NOT NULL primary key nonclustered,
 CurrentDate datetime NULL default (getdate())
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_AND_DATA )
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO 5

Can I reset the IDENTITY seed using DBCC CHECKIDENT?
No, this is not supported. If you need to reset the IDENTITY seed, only way is inserting a new value explicitly by turning SET IDENTITY_INSERT on. As shown in the example, once the record with value 100 is inserted, the next value of the seed is set to 100+1.

SET IDENTITY_INSERT dbo.MemoryOptimizedTable ON
GO
INSERT INTO dbo.MemoryOptimizedTable (Id, CurrentDate) VALUES (100, DEFAULT);
GO
SET IDENTITY_INSERT dbo.MemoryOptimizedTable OFF
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO

SELECT * FROM dbo.MemoryOptimizedTable;


What if insert a lower value explicitly?
It is possible as long as it does not violate any rules. Look at the code below. It inserts a record with value 50 explicitly. But it does not mean that the seed is getting reset to 50+1. This does not happen because the last generated value is greater than 50+1. Therefore, the value of next record is 102, not 51.

SET IDENTITY_INSERT dbo.MemoryOptimizedTable ON
GO
INSERT INTO dbo.MemoryOptimizedTable (Id, CurrentDate) VALUES (50, DEFAULT);
GO
SET IDENTITY_INSERT dbo.MemoryOptimizedTable OFF
GO

INSERT INTO dbo.MemoryOptimizedTable (CurrentDate) VALUES (DEFAULT);
GO

SELECT * FROM dbo.MemoryOptimizedTable;


Tuesday, August 1, 2017

Script for populating Date Dimension with Financial Year

Once I published the code I used for creating date dimension tables and populating data. But it had not handled finance year related date elements. Generally we handle financial year elements with Date Table with Star Schema not with Snowflake Schema. And in most cases (or old days), we usually use OLAP Data Warehouse  as the source for reporting and Analysis Services can easily handle financial dates with Start Schema implemented in the Relational Data Warehouse. However, in modern world, we try to make the solution just using the Relation Data Warehouse without using Multi-dimensional models (or using Tabular Models), thought make a another script for handling financial dates with Snowflake Schema.

With this script, four tables are created: Year, Quarter, Month and Date. If need to see elements related to the financial year, the link between Year and Quarter should be set with Year.YearKey and Quarter.FinancialYearKey. For calendar date elements, the link should be between Year.YearKey and Quarter.YearKey.

This is how you make a search for financial dates;


This is how you make a search for Calendar dates;


Here is the script for creating tables;

CREATE TABLE dbo.DimYear
(
 YearKey smallint 
 , Year smallint not null
 , Constraint pk_DimYear Primary Key (YearKey)
);
GO

CREATE TABLE dbo.DimQuarter
(
 QuarterKey smallint 
 , YearKey smallint not null
 , FinancialYearKey smallint not null
 , Quarter smallint not null
 , YearQuater varchar(20) not null
 , QuarterName varchar(20) not null
 , FinancialQuarter smallint not null
 , FinancialYearQuarter varchar(20) not null
 , Constraint pk_DimQuarter Primary Key (QuarterKey)
 , Constraint fk_DimQuarter_DimYear Foreign Key (YearKey)
  References dbo.DimYear (YearKey)
 , Constraint fk_DimQuarter_DimYear_Financial Foreign Key (FinancialYearKey)
  References dbo.DimYear (YearKey)
);
GO

CREATE TABLE dbo.DimMonth
(
 MonthKey int 
 , QuarterKey smallint not null
 , MonthNumber smallint not null
 , MonthName varchar(20) not null
 , YearMonth varchar(20) not null
 , MonthShortName char(3) not null
 , FinancialMonthNumber smallint not null
 , Constraint pk_DimMonth Primary Key (MonthKey)
 , Constraint fk_DimMonth_DimQuarter Foreign Key (QuarterKey)
  References dbo.DimQuarter (QuarterKey)
);
GO

CREATE TABLE dbo.DimDate
(
 DateKey int 
 , MonthKey int not null
 , Date date not null
 , WeekDayNumber smallint not null
 , WeekDayName varchar(20) not null
 , DayOfMonth smallint not null
 , DayOfYear smallint not null
 , IsWeekend bit not null
 , IsHoliday bit not null
 , WeekNumberOfYear smallint not null
 , Constraint pk_DimDate Primary Key (DateKey)
 , Constraint fk_DimDate_DimMonth Foreign Key (MonthKey)
  References dbo.DimMonth (MonthKey)
);
GO

Here is the script for populating dates;

SET DATEFIRST 1;

DECLARE @StartDate date = '1990-01-01';
DECLARE @EndDate date = '2019-12-31';
DECLARE @FinancialYearStartingQuarter smallint = 3; -- Starting from July, If it starts from April then 2

DECLARE @YearKey smallint;
DECLARE @Quarter smallint, @QuarterKey smallint;
DECLARE @Month int, @MonthKey int;
DECLARE @Date int, @DateKey int;

DECLARE @FinancialStartingMonth smallint;
SET @FinancialStartingMonth = CASE @FinancialYearStartingQuarter WHEN 1 THEN 1 WHEN 2 THEN 4 WHEN 3 THEN 7 WHEN 4 THEN 10 END

INSERT INTO dbo.DimYear 
 (YearKey, [Year]) 
VALUES 
 (YEAR(@StartDate) - 1, YEAR(@StartDate) - 1);


WHILE (@StartDate <= @EndDate)
BEGIN

 -- Inserting years
 SET @YearKey = YEAR(@StartDate);
 IF NOT EXISTS (SELECT * FROM dbo.DimYear WHERE YearKey = @YearKey)
  INSERT INTO dbo.DimYear (YearKey, [Year]) 
  VALUES 
  (@YearKey, @YearKey);


 -- Inserting quarters
 SET @QuarterKey = Convert(smallint, Convert(varchar(4), YEAR(@StartDate)) + Convert(varchar(1), DATEPART(q, @StartDate)))
 SET @Quarter = DATEPART(q, @StartDate);

 IF NOT EXISTS (SELECT * FROM dbo.DimQuarter WHERE QuarterKey = @QuarterKey)
 INSERT INTO dbo.DimQuarter 
  (QuarterKey, YearKey
  , FinancialYearKey
  , [Quarter], YearQuater, QuarterName
  , FinancialQuarter, FinancialYearQuarter) 
 VALUES (@QuarterKey, @YearKey
  , CASE WHEN @Quarter < @FinancialYearStartingQuarter THEN @YearKey -1 ELSE @YearKey END
  , @Quarter, Convert(varchar(4), YEAR(@StartDate)) + 'Q' + Convert(varchar(4), @Quarter)
  , CASE @Quarter WHEN 1 THEN 'First Quarter' WHEN 2 THEN 'Second Quarter' WHEN 3 THEN 'Third Quarter' ELSE 'Forth Quarter' END
  , CASE @Quarter WHEN 1 THEN 3 WHEN 2 THEN 4 WHEN 3 THEN 1 WHEN 4 THEN 2 END 
  , CASE @Quarter WHEN 1 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q3'
      WHEN 2 THEN Convert(varchar(4), YEAR(@StartDate) - 1) + 'Q4'
      WHEN 3 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q1'
      WHEN 4 THEN Convert(varchar(4), YEAR(@StartDate)) + 'Q2'END
  );
  

 ---- Inserting months
 SET @MonthKey = Convert(int, Convert(varchar(4), YEAR(@StartDate)) + (CASE LEN(MONTH(@StartDate)) WHEN 1 THEN '0' ELSE '' END) + Convert(varchar(2), MONTH(@StartDate)));
 SET @Month = MONTH(@StartDate)
 IF NOT EXISTS (SELECT * FROM dbo.DimMonth WHERE MonthKey = @MonthKey)
  INSERT INTO dbo.DimMonth 
   (MonthKey, QuarterKey, MonthNumber, MonthName, YearMonth, MonthShortName
   , FinancialMonthNumber) 
  VALUES 
   (@MonthKey, @QuarterKey, @Month, DATENAME(MONTH, @StartDate)
   , Convert(varchar(4), YEAR(@StartDate)) + ' ' + LEFT(DATENAME(MONTH, @StartDate), 3)
   , LEFT(DATENAME(MONTH, @StartDate), 3)
   , CASE 
     WHEN @FinancialStartingMonth = 1 THEN @Month
     WHEN @FinancialStartingMonth = 4 AND @Month  < @FinancialStartingMonth THEN @Month + 9
     WHEN @FinancialStartingMonth = 4 AND @Month  >= @FinancialStartingMonth THEN @Month - 3
     WHEN @FinancialStartingMonth = 7 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 7 AND @Month  >= @FinancialStartingMonth THEN @Month - 6
     WHEN @FinancialStartingMonth = 10 AND @Month  < @FinancialStartingMonth THEN @Month + 6
     WHEN @FinancialStartingMonth = 10 AND @Month  >= @FinancialStartingMonth THEN @Month + 3
    END
   );
  
  ---- Inserting dates
  SET @DateKey = Convert(int, Convert(varchar(8), @StartDate, 112))
  IF NOT EXISTS (SELECT * FROM dbo.DimDate WHERE DateKey = @DateKey)
   INSERT INTO dbo.DimDate (DateKey, MonthKey, Date, WeekDayNumber, WeekDayName, DayOfMonth
    , DayOfYear, IsWeekend, IsHoliday, WeekNumberOfYear) 
   VALUES (@DateKey, @MonthKey, @StartDate, DatePart(WEEKDAY, @StartDate), DateName(WEEKDAY, @StartDate), DatePart(DAY, @StartDate)
    , DatePart(DAYOFYEAR, @StartDate), CASE WHEN DatePart(WEEKDAY, @StartDate) IN (6, 7) THEN 1 ELSE 0 END, 0
    , DatePart(WEEK, @StartDate));

  SET @StartDate = DATEADD(dd, 1, @StartDate);
END;
GO

Monday, July 31, 2017

SQL Server bcp error - String data, right truncation

We generally expect a proper row terminator with files loaded using bcp and with all accurately defined switches, data should be loaded without any issue. But sometime, you may see issues like below even if you try with switches like -r "\r\n" or -r "\n".


In most cases, the reason is the row terminator. If you open the file to be loaded using something like Notepad++, and enable Show All Characters (icon is in the toolbox), you should see the exact row terminator.


If you see something like above, then you need to add the row terminator as a hexadecimal value. In this case, if I add -r "0x0a", then my data loading will work without any issue.

Tuesday, July 11, 2017

Power BI supports numeric range slicer now

This is not something released this month but March 2017. These new capabilities of Power BI Slicer is available but many do not use it because it is still under Preview and it has some limitations. But, let's see how useful it is and what sort of limitations it has.

As you see below, I have created a Power BI Report connecting with AdventureWorks2014 local database using DirectQuery mode. This is the view I used as the source;

USE AdventureWorks2014;
GO

CREATE VIEW dbo.SalesByCustomer
AS
SELECT 
      p.LastName + ' ' + p.FirstName Customer
   , t.Name
   , SUM([SubTotal]) Amount
  FROM [Sales].[SalesOrderHeader] s
 INNER JOIN Sales.Customer c
  ON s.CustomerID = c.CustomerID
 INNER JOIN Person.Person p
  ON p.BusinessEntityID = c.PersonID
 INNER JOIN Sales.SalesTerritory t
  ON s.TerritoryID = t.TerritoryID
GROUP BY p.LastName + ' ' + p.FirstName, t.Name;

The table of the report is created with Customer and Amount. Then two slicers are added, one using Amount and other user Name (Territory).


You will not see Numeric Range Filter unless you have enabled it. For enabling it, go to File -> Options and settings -> Options -> Preview features and check Numeric range slicer.


Once it is enabled, whenever a numeric value is dragged to a slicer, range slicer will be appeared automatically.


Multiple options such as Less than or equal to, Greater than or equal to are given with it for filtering based on values in the input boxes.

We can use this without any issue but you will face a limitation if you publish this to Power BI Service and view it.


As you see, this feature is still not available in Power BI Service. There are two more things to remember on this feature, 1) measures that are created with the model or measure in Analysis Services models cannot be used with this, 2) this filters row data that come from the source, not aggregated data shown in visuals.

Sunday, June 25, 2017

Database Normalization - 1NF, 2NF, 3NF, BCNF, 4NF and 5NF with examples

Normalization is a process of identifying the optimal grouping (relations at the end) for attributes that satisfies data requirements in an organization. It is a database design technique we use after completing ER modeling. This process identifies relationships between attributes (called functional dependencies) and applies series of tests described as normal forms.

There are many articles written on this and you can find examples for almost all normal forms. However many articles explains the theory with the same scenario, hence thought to make a post with different set of examples that I use for my lectures.

As explained above, the purpose of normalization is to identify the best grouping for attributes that ultimately forms relations. Some of the characteristics of relations formed are;
  • Support of data requirements with minimal number of attributes
  • Relation holds attributes with a close logical relationship (functional dependency)
  • Relation holds minimal redundancy with each attribute (except foreign keys)
    • increasing the performance of updates
    • reducing the storage consumption
    • avoiding update anomalies (insertion, modification and deletion)

Functional Dependency

Let's try to understand functional dependency first. This speaks about the relationship between attributes in a relation. For an example, if EmployeeCode and FirstName are attributes of Employee relation, we can say that FirstName is functionally dependent on EmployeeCode. This means, each EmployeeCode is associated with exactly one value of FirstName. We denote this like;

EmployeeCode  -> FirstName

Remember, in above relationship, we call the EmployeeCode as determinant. Basically the left-hand side of the arrow is considered as the determinant. The relationship between left to right is always one to one (1:1).

If the right-hand attribute is fully dependent on left-hand side, we call this dependency as full functional dependency. If the left-hand side is a composite one (two or more attributes) and right-hand side can be determined by part of left-hand side, then the dependency is considered as partial dependency (you will see an example of it later).

Why need to identify functional dependencies? One of the reasons for that is, identifying the best candidate for the primary key. Once functional dependencies are identified, we can analyze all and select the most suitable determinant as the primary key.

First Normal Form (1NF)

The definition of this goes as A relation in which the intersection of each row and column contains one and only one value. Let's try to understand this with an example. The following table shows an identified relation for Student Registration for courses. As you see, a tuple represents a registration that is done for a date.


In order to make sure that the relation is normalized for 1NF, we need to make sure that;
  • No multiple values in intersection of each row and column
  • No repeatable groups in attributes (like Course1, Course2, Course3... columns)
  • Order of attributes and tuples are insignificant
  • No duplicate tuples in the relation.
You can see that Course attribute has multiple value that violates the 1NF. There are multiple ways for addressing this but if I need to handle it without decomposing the relation, I can organize my tuples as below.


Since the relation has no multiple values in intersections and no repeatable groups, it is now a 1NF relation.

Second Normal Form (2NF)

The definition of second normal form is A relation that is in First Normal form and every non-primary-key attribute is fully dependent on the primary key. What is says is, there should not be partial dependency between primary key and non-primary key.

Let's try to set the primary key for above table. For that, let's list out some functional dependencies;

StudentCode, Course  ->  Name, Town, Province, Course, DateRegistered
StudentCode  ->  Name
Town  ->  Province

Considering above identified functional dependencies, I can easily pick the first one, that is StudentCode, Course as my primary key because the combination of them can be used for identifying the tuple easily.


Okay, now the primary key is StudentCode+Course. However, we know that StudentCode  -> Name relationship is still exist. This means that Name can be determined by part of the primary key, that is partial dependency. This is the violation of second normal form.

We can decompose the relation now into two for making sure that relations do not violating the 2NF.


Note that you will not see violation of 2NF if the primary key is based on just one attribute.


Third Normal Form (3NF)

This normal form speaks about transitive dependency. The definition goes as A relation that is in First and Second Normal form and in which no non-primary-key attribute is transitively dependent on the primary key.

This says that we should remove transitive dependency if they are exist. What is transitive dependency? It is a condition such as in Student relation, StudentCode determines the Town (StudentCode  ->  Town  - There is only one two associated with a given StudentCode) and Town determines the Province (Town  ->  Province), therefore StudentCode determines Province (Note that, as per this relation StudentCode detemines Province but the issue is it can be determined by Town too). This is transitive dependency. In other words, if you see that Attribute A determines B (A  ->  B) and B determines C (B  ->  C), then A determines C (A  ->  C).

For removing transitive dependency, we need to decompose the relation.


Boyce-Codd Normal Form (BCNF / 3.5NF)

This is an extension of 3NF and it is sometime treated as 3.5NF. This makes the 3NF more stronger by making sure that every non-primary-key determinant is a candidate key with identified functional dependencies. The definition goes as A relation is in BCNF, if and only if, every determinant is a candidate key.

What does it exactly means? You have already seen that we can identify many functional dependencies in a relation and we pick one for defining the primary key. The determinants of other identified functional dependencies can be candidate keys for the primary key or they might not be qualified for the primary key. If you see that all determinants are qualified, means you can mark them as the primary key if need, then your relation (table) is in BCNF.

Take this example.


Assume that business rules related to this relation are as follows;
  1. Course has one or more subjects.
  2. Course is managed by one or more lecturers.
  3. Subject is taught by one or more lecturers.
  4. Lecturer teaches only one subject.
If you consider the primary key of this table is Course + Subject, then no violation of 1NF, 2NF and 3NF. Let's list out all possible functional dependencies.
  1. Course, Subject  ->  Lecturer
  2. Course, Lecturer  ->  Subject
  3. Lecturer  ->  Subject
Now, based on the identified functional dependencies, see whether you can make determinants as candidate keys. If you take the first one, we can clearly say that Course + Subject is a candidate key. Second one that is Course + Lecturer is also a candidate key as we can identify tuples uniquely using it. However the determinant of the third one cannot be used as a candidate key because it has duplicates. You cannot make Lecturer as a primary key. Now you have a determinant that cannot be set as a primary key, hence it violates BCNF.

In order to make the table BCNF table, need to decompose as below.


Forth Normal Form (4NF)

This normal form handles multi-valued dependencies caused by 1NF. When we see repeated groups or multiple values in an intersection, we add additional tuples removing multiple values. That is what we do with 1NF. When there are two multi-value attributes in a relation, then each value in one of the attributes has to be repeated with every value of the other attribute. This situation is referred as a multi-valued dependency. See below relation;


If we apply 1NF to this relation;


The definition of the multi-valued dependency goes as Represent a dependency between attributes in a relation, such that for each value of A there is a set of values for B and set of values for C. However the set of values for B and C are independent of each other. This dependency denotes as A ->> B.

See the CustomerContacts table. CustomerCode determines multiple Telephone (CustomerCode ->> Telephone) and CustomerCode determines multiple Address (CustomerCode  ->>  Telephone).

The forth normal form is describes as A relation that is in Boyce-Codd normal form and does not contain nontrivial multi-valued dependencies. This talks about one type of multi-valued dependency that is nontrivial. Trivial relationship means; if B is subset of A or A U B = R. Else it is Nontrivial. As you see, CustomerContact contains nontrivial dependencies, hence need to decompose the table as below.


Fifth Normal Form (5NF)

In order to normalize relations, we decompose the relations into multiple relations. Although multiple divided relations optimize transactions and avoid anomalies, it adds a cost for data retrieval as relations have to be rejoined. The biggest risk with rejoining is, producing inaccurate outputs in certain conditions. 

When we decompose a relation into two relations, the resulting relations have the property called lossless-join that makes sure rejoining two relations produce the original relation. The definition of lossless-join is, A property of decomposition, which ensures that no spurious tuples are generated when relations are reunited through a natural join operation.

Now let's try to understand Fifth Normal Form. When decomposing a relation into multiple relations for minimizing redundency, it might introduce join dependency, that might create spurious tuples when they are reunited. The definition of Join Dependency goes as for a relation R with subsets of the attributes of R denoted as A, B, ..., Z a relation R satisfies a join dependency if and only if every legal value of R is equal to the join its projections on A, B, ..., Z. Considering this, definition of Fifth normal form goes as A relation that has no join dependency.

Since this is very rare to see in database design, let's try to understand with an example. See the following table that contains how Lecturers teaches Subjects related to Courses.


Assume that the tuples are formed based on the following scenario;
  • Whenever Lecturer L1 teaches Subject S1,
  • and Course C1 has Subject S1,
  • and Lecturer L1 teaches at least one subject in Course C1,
  • then Lecturer L1 will be teaching Subject S1 in Course C1.
Note that we have this scenario for explaining the 5NF, otherwise you will not see it properly.

Now if I try to decompose this relation into two relations for minimizing redundant data, I will be having these two tables (Sequences are added for understanding joins only);



Now, if I need to rejoin these with Natural Join (Read about Join at: ), this will be the result.


See the highlighted one. It is the result of Join Dependency. It is a spurious tuple which is not valid. In order to avoid it, in order to make the relations as 5NF relation, let's introduce another relation like below;


Now, if we rejoin all three, we will not see any issue.


Therefore, in order to minimize redundant data and make sure no join dependency, make sure relations are formed as 5NF relations.

Wednesday, June 21, 2017

Power BI Parameters - What are they and when they can be used?

Microsoft Power BI supports adding parameters for queries and use them with various areas. However, since it has some limitations and cannot use with all areas, it has not been noticed much. Let's talk about Power BI parameters and see the way of using it.

Let's take a scenario and see how parameters can help on that. I have a set of CSV files that shows monthly sales. We need to to create a report using one of the files and then publish. Whenever the report needs to be refreshed for a different file, I need to change the data source by changing many things. What if I can just open the report using Power BI Desktop and change the source using a menu and get the report refreshed rather changing many things?

It can be done with Power BI Parameters. We can create a parameter with predefined values (in this case, source file names along with paths) and use it for changing the source.

Power BI Parameter can be created in three ways;
  • Parameter that accepts a value via an input box
  • Parameter that is formed using a predefined list
  • Parameter that is formed using another query
Parameter with a predefined list
Let's focus on the second one first. Here are the steps for creating a Power BI report with a parameter;

1. Open Power BI and connect with a CSV file. In my case, the file name is Sales_201501.csv.


2. Once the file is loaded, click Edit Queries in Home ribbon for opening Query Editor. You should one query under Queries

Parameters can be created only with Query Editor. However, created parameters can be accessed in both Query Editor and Data Model.

3. Click Manage Parameters in Home ribbon. Once the dialog box is opened, click New to create a new Parameter.

4. Name the parameter as SourceFile

5. Select Text for Type.

6. Select List of values for Suggested values.

7. Enter File Names along with paths.

8. Set Default Value as you need. This is the default value for the designer.

9. Set Current Value as you need. This setting is for the end user, or to use with the report level.


Now you should see the parameter in Queries. If you want to open it as another table in the data model, you can right click on it and select Enable Load. We do not need to enable this option for using the parameter at report level.

Now we need to make sure that source of our file is set with the created parameter. There are two ways of doing it. One is open the Advanced Editor and change M Code. Other way is, using Data source settings. Let's use the second option. If you need to know how the M Code is getting changed, open it and see after the change.

10. Click Data source settings in the Home ribbon. This opens the Data Source Setting dialog box.


11. Select the source and click Change Source.... This opens a dialog box that allows you to change settings related to your source.

12. You should notice that the button given for File Path has an option for creating parameters and selecting a parameter.


13. Select Parameter as the option and select name parameter created.


14. Click OK and close the Data source settings.

15. All set. Click Close & Apply for saving the queries and closing Query Editor.

16. Create a visual as you can distinguish results when different file is selected. This is what I created from my file.


Let's change the source using the parameter now.

17. Click down-arrow in Edit Queries in Home ribbon. You should see three options. Select the last one that is Edit Parameter.


18. You should see Enter Parameters dialog box that shows all parameters. Change the source file to different one.


19. Once clicked OK, you should be prompted for Applying. Apply the settings and you should see the report with new source now.


Like this way, we can use the parameter in other sections like Filtering rows in Query Editor, Replacing, etc. We still cannot use the parameter with DAX and it is one limitations.

Saturday, June 17, 2017

Relational Algebra - Joins - Theta Join, Equijoin, Natural Join, Outer Join, Semijoin

I wrote a post on Relational Algebra that discusses most of operations related to it. This is the continuation of it and this explains Join Operations related to Relational Algebra. You may find that it is different with Joins we do now but it is the foundation for all type of joins we do with our relational databases.

Join Operations

Join is one of the main operations we perform against relations (tables) for retrieving data. It is done over the Cartesian Product of the two operand relations, using a SELECT statement with a Predicate. We are familiar with Joins like INNER JOIN, OUTER JOIN and CROSS JOIN but initially there were introduced with types like Theta Join, Equijoin, Natural Join, Outer Join and Semijoin. Modern DBMSs have enhanced these and have different implementations and that is why we do not see these types with mentioned names. But let's try to understand each of these types and how they are represented with Relational Algebra. Note that I have used different SQL Statement but it can be implemented using many techniques. Remember, almost all Joins are based on Cartesian Products.

Theta Join

This is based on a Predicate added to a Cartesian Product. In simple term, if you have joined two tables using CROSS JOIN, then you can add a filter to the result using one of the comparison operators. See the example given. Note that it can be implemented using SELECTION over a Cartesian Product as well.


Equijoin

This is same as Theta Join but the comparison operator is equal. Generally, if the operator of the Theta Join is equal operator (=), then the join is called as Equijoin instead of Theta Join, Here are two examples;



Natural Join

Natural Join is an Equijoin of two relations over all common attributes. In other words, when joining two tables, join is done using all common columns. Therefore, explicit Predicate is not required. See the sample given. I have used NATURAL JOIN which is not available with some DBMSs. Note that Common Attributes are not duplicated.


Outer Join

This join type includes both matching and no matching values from one relation and matching values from the other relation when two relations are joined. The relation that returns all tuples is determined using the Symbol used for the operation. If the Symbol is opened for the Left Relation, it is considered as the relation that returns all tuples. This is implemented using either LEFT or RIGHT in SQL.


Semijoin

Here is the last one. This join performs a join operations over two relations and projects over the attributes of first operand (or the relation). With this join, tuples can be limited for the join operation by adding a predicate, increasing the performance of the join operation.

 

Wednesday, June 14, 2017

Power BI - Scheduled Refresh section is not available in Settings

Power BI Desktop allows us to import data from files such as CSV and Excel. Once imported, it automatically creates a Model that can be used for creating reports. Good thing is, when we publish the report, Dataset that is created with the model is also getting published. This makes sure that we can create a schedule and refresh the Dataset published for seeing latest.

See below image. It shows the Gateway connection and Scheduled refresh settings for an uploaded text file. The Dataset created for the file is PowerBI Test Report.


Now if I get the settings related to the Dataset called Sales_201501;


I will not see same settings. What is the reason? It is also based on a text file.

This will happen when the uploaded Dataset is not a Model. This file has been uploaded using Power BI Services (app.powerbi.com), not using Power BI Desktop. Unlike Power BI Desktop, Power BI Services does not create a Model when a file is uploaded and uploaded file will be sill treated as a Dataset but not as a Model.

If you need to get your Dataset continuously refreshed by creating a schedule, then make sure that it is created using Power BI Desktop.

You can get the uploaded file refreshed automatically if the file is in either SharePoint or OneDrive.