Sunday, June 25, 2017

Database Normalization - 1NF, 2NF, 3NF, BCNF, 4NF and 5NF with examples

Normalization is a process of identifying the optimal grouping (relations at the end) for attributes that satisfies data requirements in an organization. It is a database design technique we use after completing ER modeling. This process identifies relationships between attributes (called functional dependencies) and applies series of tests described as normal forms.

There are many articles written on this and you can find examples for almost all normal forms. However many articles explains the theory with the same scenario, hence thought to make a post with different set of examples that I use for my lectures.

As explained above, the purpose of normalization is to identify the best grouping for attributes that ultimately forms relations. Some of the characteristics of relations formed are;
  • Support of data requirements with minimal number of attributes
  • Relation holds attributes with a close logical relationship (functional dependency)
  • Relation holds minimal redundancy with each attribute (except foreign keys)
    • increasing the performance of updates
    • reducing the storage consumption
    • avoiding update anomalies (insertion, modification and deletion)

Functional Dependency

Let's try to understand functional dependency first. This speaks about the relationship between attributes in a relation. For an example, if EmployeeCode and FirstName are attributes of Employee relation, we can say that FirstName is functionally dependent on EmployeeCode. This means, each EmployeeCode is associated with exactly one value of FirstName. We denote this like;

EmployeeCode  -> FirstName

Remember, in above relationship, we call the EmployeeCode as determinant. Basically the left-hand side of the arrow is considered as the determinant. The relationship between left to right is always one to one (1:1).

If the right-hand attribute is fully dependent on left-hand side, we call this dependency as full functional dependency. If the left-hand side is a composite one (two or more attributes) and right-hand side can be determined by part of left-hand side, then the dependency is considered as partial dependency (you will see an example of it later).

Why need to identify functional dependencies? One of the reasons for that is, identifying the best candidate for the primary key. Once functional dependencies are identified, we can analyze all and select the most suitable determinant as the primary key.

First Normal Form (1NF)

The definition of this goes as A relation in which the intersection of each row and column contains one and only one value. Let's try to understand this with an example. The following table shows an identified relation for Student Registration for courses. As you see, a tuple represents a registration that is done for a date.


In order to make sure that the relation is normalized for 1NF, we need to make sure that;
  • No multiple values in intersection of each row and column
  • No repeatable groups in attributes (like Course1, Course2, Course3... columns)
  • Order of attributes and tuples are insignificant
  • No duplicate tuples in the relation.
You can see that Course attribute has multiple value that violates the 1NF. There are multiple ways for addressing this but if I need to handle it without decomposing the relation, I can organize my tuples as below.


Since the relation has no multiple values in intersections and no repeatable groups, it is now a 1NF relation.

Second Normal Form (2NF)

The definition of second normal form is A relation that is in First Normal form and every non-primary-key attribute is fully dependent on the primary key. What is says is, there should not be partial dependency between primary key and non-primary key.

Let's try to set the primary key for above table. For that, let's list out some functional dependencies;

StudentCode, Course  ->  Name, Town, Province, Course, DateRegistered
StudentCode  ->  Name
Town  ->  Province

Considering above identified functional dependencies, I can easily pick the first one, that is StudentCode, Course as my primary key because the combination of them can be used for identifying the tuple easily.


Okay, now the primary key is StudentCode+Course. However, we know that StudentCode  -> Name relationship is still exist. This means that Name can be determined by part of the primary key, that is partial dependency. This is the violation of second normal form.

We can decompose the relation now into two for making sure that relations do not violating the 2NF.


Note that you will not see violation of 2NF if the primary key is based on just one attribute.


Third Normal Form (3NF)

This normal form speaks about transitive dependency. The definition goes as A relation that is in First and Second Normal form and in which no non-primary-key attribute is transitively dependent on the primary key.

This says that we should remove transitive dependency if they are exist. What is transitive dependency? It is a condition such as in Student relation, StudentCode determines the Town (StudentCode  ->  Town  - There is only one two associated with a given StudentCode) and Town determines the Province (Town  ->  Province), therefore StudentCode determines Province (Note that, as per this relation StudentCode detemines Province but the issue is it can be determined by Town too). This is transitive dependency. In other words, if you see that Attribute A determines B (A  ->  B) and B determines C (B  ->  C), then A determines C (A  ->  C).

For removing transitive dependency, we need to decompose the relation.


Boyce-Codd Normal Form (BCNF / 3.5NF)

This is an extension of 3NF and it is sometime treated as 3.5NF. This makes the 3NF more stronger by making sure that every non-primary-key determinant is a candidate key with identified functional dependencies. The definition goes as A relation is in BCNF, if and only if, every determinant is a candidate key.

What does it exactly means? You have already seen that we can identify many functional dependencies in a relation and we pick one for defining the primary key. The determinants of other identified functional dependencies can be candidate keys for the primary key or they might not be qualified for the primary key. If you see that all determinants are qualified, means you can mark them as the primary key if need, then your relation (table) is in BCNF.

Take this example.


Assume that business rules related to this relation are as follows;
  1. Course has one or more subjects.
  2. Course is managed by one or more lecturers.
  3. Subject is taught by one or more lecturers.
  4. Lecturer teaches only one subject.
If you consider the primary key of this table is Course + Subject, then no violation of 1NF, 2NF and 3NF. Let's list out all possible functional dependencies.
  1. Course, Subject  ->  Lecturer
  2. Course, Lecturer  ->  Subject
  3. Lecturer  ->  Subject
Now, based on the identified functional dependencies, see whether you can make determinants as candidate keys. If you take the first one, we can clearly say that Course + Subject is a candidate key. Second one that is Course + Lecturer is also a candidate key as we can identify tuples uniquely using it. However the determinant of the third one cannot be used as a candidate key because it has duplicates. You cannot make Lecturer as a primary key. Now you have a determinant that cannot be set as a primary key, hence it violates BCNF.

In order to make the table BCNF table, need to decompose as below.


Forth Normal Form (4NF)

This normal form handles multi-valued dependencies caused by 1NF. When we see repeated groups or multiple values in an intersection, we add additional tuples removing multiple values. That is what we do with 1NF. When there are two multi-value attributes in a relation, then each value in one of the attributes has to be repeated with every value of the other attribute. This situation is referred as a multi-valued dependency. See below relation;


If we apply 1NF to this relation;


The definition of the multi-valued dependency goes as Represent a dependency between attributes in a relation, such that for each value of A there is a set of values for B and set of values for C. However the set of values for B and C are independent of each other. This dependency denotes as A ->> B.

See the CustomerContacts table. CustomerCode determines multiple Telephone (CustomerCode ->> Telephone) and CustomerCode determines multiple Address (CustomerCode  ->>  Telephone).

The forth normal form is describes as A relation that is in Boyce-Codd normal form and does not contain nontrivial multi-valued dependencies. This talks about one type of multi-valued dependency that is nontrivial. Trivial relationship means; if B is subset of A or A U B = R. Else it is Nontrivial. As you see, CustomerContact contains nontrivial dependencies, hence need to decompose the table as below.


Fifth Normal Form (5NF)

In order to normalize relations, we decompose the relations into multiple relations. Although multiple divided relations optimize transactions and avoid anomalies, it adds a cost for data retrieval as relations have to be rejoined. The biggest risk with rejoining is, producing inaccurate outputs in certain conditions. 

When we decompose a relation into two relations, the resulting relations have the property called lossless-join that makes sure rejoining two relations produce the original relation. The definition of lossless-join is, A property of decomposition, which ensures that no spurious tuples are generated when relations are reunited through a natural join operation.

Now let's try to understand Fifth Normal Form. When decomposing a relation into multiple relations for minimizing redundency, it might introduce join dependency, that might create spurious tuples when they are reunited. The definition of Join Dependency goes as for a relation R with subsets of the attributes of R denoted as A, B, ..., Z a relation R satisfies a join dependency if and only if every legal value of R is equal to the join its projections on A, B, ..., Z. Considering this, definition of Fifth normal form goes as A relation that has no join dependency.

Since this is very rare to see in database design, let's try to understand with an example. See the following table that contains how Lecturers teaches Subjects related to Courses.


Assume that the tuples are formed based on the following scenario;
  • Whenever Lecturer L1 teaches Subject S1,
  • and Course C1 has Subject S1,
  • and Lecturer L1 teaches at least one subject in Course C1,
  • then Lecturer L1 will be teaching Subject S1 in Course C1.
Note that we have this scenario for explaining the 5NF, otherwise you will not see it properly.

Now if I try to decompose this relation into two relations for minimizing redundant data, I will be having these two tables (Sequences are added for understanding joins only);



Now, if I need to rejoin these with Natural Join (Read about Join at: ), this will be the result.


See the highlighted one. It is the result of Join Dependency. It is a spurious tuple which is not valid. In order to avoid it, in order to make the relations as 5NF relation, let's introduce another relation like below;


Now, if we rejoin all three, we will not see any issue.


Therefore, in order to minimize redundant data and make sure no join dependency, make sure relations are formed as 5NF relations.

Wednesday, June 21, 2017

Power BI Parameters - What are they and when they can be used?

Microsoft Power BI supports adding parameters for queries and use them with various areas. However, since it has some limitations and cannot use with all areas, it has not been noticed much. Let's talk about Power BI parameters and see the way of using it.

Let's take a scenario and see how parameters can help on that. I have a set of CSV files that shows monthly sales. We need to to create a report using one of the files and then publish. Whenever the report needs to be refreshed for a different file, I need to change the data source by changing many things. What if I can just open the report using Power BI Desktop and change the source using a menu and get the report refreshed rather changing many things?

It can be done with Power BI Parameters. We can create a parameter with predefined values (in this case, source file names along with paths) and use it for changing the source.

Power BI Parameter can be created in three ways;
  • Parameter that accepts a value via an input box
  • Parameter that is formed using a predefined list
  • Parameter that is formed using another query
Parameter with a predefined list
Let's focus on the second one first. Here are the steps for creating a Power BI report with a parameter;

1. Open Power BI and connect with a CSV file. In my case, the file name is Sales_201501.csv.


2. Once the file is loaded, click Edit Queries in Home ribbon for opening Query Editor. You should one query under Queries

Parameters can be created only with Query Editor. However, created parameters can be accessed in both Query Editor and Data Model.

3. Click Manage Parameters in Home ribbon. Once the dialog box is opened, click New to create a new Parameter.

4. Name the parameter as SourceFile

5. Select Text for Type.

6. Select List of values for Suggested values.

7. Enter File Names along with paths.

8. Set Default Value as you need. This is the default value for the designer.

9. Set Current Value as you need. This setting is for the end user, or to use with the report level.


Now you should see the parameter in Queries. If you want to open it as another table in the data model, you can right click on it and select Enable Load. We do not need to enable this option for using the parameter at report level.

Now we need to make sure that source of our file is set with the created parameter. There are two ways of doing it. One is open the Advanced Editor and change M Code. Other way is, using Data source settings. Let's use the second option. If you need to know how the M Code is getting changed, open it and see after the change.

10. Click Data source settings in the Home ribbon. This opens the Data Source Setting dialog box.


11. Select the source and click Change Source.... This opens a dialog box that allows you to change settings related to your source.

12. You should notice that the button given for File Path has an option for creating parameters and selecting a parameter.


13. Select Parameter as the option and select name parameter created.


14. Click OK and close the Data source settings.

15. All set. Click Close & Apply for saving the queries and closing Query Editor.

16. Create a visual as you can distinguish results when different file is selected. This is what I created from my file.


Let's change the source using the parameter now.

17. Click down-arrow in Edit Queries in Home ribbon. You should see three options. Select the last one that is Edit Parameter.


18. You should see Enter Parameters dialog box that shows all parameters. Change the source file to different one.


19. Once clicked OK, you should be prompted for Applying. Apply the settings and you should see the report with new source now.


Like this way, we can use the parameter in other sections like Filtering rows in Query Editor, Replacing, etc. We still cannot use the parameter with DAX and it is one limitations.

Saturday, June 17, 2017

Relational Algebra - Joins - Theta Join, Equijoin, Natural Join, Outer Join, Semijoin

I wrote a post on Relational Algebra that discusses most of operations related to it. This is the continuation of it and this explains Join Operations related to Relational Algebra. You may find that it is different with Joins we do now but it is the foundation for all type of joins we do with our relational databases.

Join Operations

Join is one of the main operations we perform against relations (tables) for retrieving data. It is done over the Cartesian Product of the two operand relations, using a SELECT statement with a Predicate. We are familiar with Joins like INNER JOIN, OUTER JOIN and CROSS JOIN but initially there were introduced with types like Theta Join, Equijoin, Natural Join, Outer Join and Semijoin. Modern DBMSs have enhanced these and have different implementations and that is why we do not see these types with mentioned names. But let's try to understand each of these types and how they are represented with Relational Algebra. Note that I have used different SQL Statement but it can be implemented using many techniques. Remember, almost all Joins are based on Cartesian Products.

Theta Join

This is based on a Predicate added to a Cartesian Product. In simple term, if you have joined two tables using CROSS JOIN, then you can add a filter to the result using one of the comparison operators. See the example given. Note that it can be implemented using SELECTION over a Cartesian Product as well.


Equijoin

This is same as Theta Join but the comparison operator is equal. Generally, if the operator of the Theta Join is equal operator (=), then the join is called as Equijoin instead of Theta Join, Here are two examples;



Natural Join

Natural Join is an Equijoin of two relations over all common attributes. In other words, when joining two tables, join is done using all common columns. Therefore, explicit Predicate is not required. See the sample given. I have used NATURAL JOIN which is not available with some DBMSs. Note that Common Attributes are not duplicated.


Outer Join

This join type includes both matching and no matching values from one relation and matching values from the other relation when two relations are joined. The relation that returns all tuples is determined using the Symbol used for the operation. If the Symbol is opened for the Left Relation, it is considered as the relation that returns all tuples. This is implemented using either LEFT or RIGHT in SQL.


Semijoin

Here is the last one. This join performs a join operations over two relations and projects over the attributes of first operand (or the relation). With this join, tuples can be limited for the join operation by adding a predicate, increasing the performance of the join operation.

 

Wednesday, June 14, 2017

Power BI - Scheduled Refresh section is not available in Settings

Power BI Desktop allows us to import data from files such as CSV and Excel. Once imported, it automatically creates a Model that can be used for creating reports. Good thing is, when we publish the report, Dataset that is created with the model is also getting published. This makes sure that we can create a schedule and refresh the Dataset published for seeing latest.

See below image. It shows the Gateway connection and Scheduled refresh settings for an uploaded text file. The Dataset created for the file is PowerBI Test Report.


Now if I get the settings related to the Dataset called Sales_201501;


I will not see same settings. What is the reason? It is also based on a text file.

This will happen when the uploaded Dataset is not a Model. This file has been uploaded using Power BI Services (app.powerbi.com), not using Power BI Desktop. Unlike Power BI Desktop, Power BI Services does not create a Model when a file is uploaded and uploaded file will be sill treated as a Dataset but not as a Model.

If you need to get your Dataset continuously refreshed by creating a schedule, then make sure that it is created using Power BI Desktop.

You can get the uploaded file refreshed automatically if the file is in either SharePoint or OneDrive.

Saturday, June 10, 2017

Power BI - Cannot highlight or pin visuals when grouped with shapes

Power BI allows us to add shapes like rectangle or oval for marking specific area, showing covered visuals as they are related and operate together. This is really useful when you need to highlight a group of visuals that has same behavior or the visuals are related to same business process.


When you have visuals, you can add the required shape and change properties as you need. The below image shows an added rectangle with specific Title and filled color with transperancy set to 75%.


Once added and set, it can be nicely seen with the report. But you may notice that you have lost some of the functionalities. You cannot highlight a slice of Pie Chart and make other visuals filtered. You cannot pin the visuals for Dashboards when they are published to online service.


However, this can be handled. The reason for this is, the added rectangle appears on top of visuals. If you send the rectangle "back", then this issue gets sorted.


As you see, now you can highlight slices in the Pie Chart and you can pin visuals to a Dashboard.


Wednesday, June 7, 2017

Naming columns in the View

How do you name columns returned from your created view? Generally, you do not need to explicitly name them as the names of view-columns are derived from base-tables. But, if there is any ambiguity in the name for a column or calculated column, then it must be named explicitly.

What are the ways of naming columns?

There are two ways of naming columns in the view. One way is, mentioning names along with CREATE VIEW statement. This requires names for all columns, you cannot just name set of columns in the view. Here is an example of it;


Other way is, adding aliases to columns. Most of us prefer this way because we do not need to name all columns explicitly. In most cases, we need to name columns that has ambiguity in the name or it is a calculated column. Remember if above technique is used for naming columns, aliases added will be overwritten.



Tuesday, June 6, 2017

Creating Horizontal Views and Vertical Views

Everyone has either used or created views for getting or creating a dynamic result using one or multiple tables. And everyone knows the usage of views and purpose of creating them. But many do not how they have been categorized; such as Horizontal and Vertical. Here is a simple explanation on them.

Horizontal Views

If you think about the standard definition of a view, it is a virtual relation (table) that is result of one or more relational operations on one or more relations (tables). In simple terms, is a logical table defined with a SELECT statement based on one or more tables. Now, what is a Horizontal View? It is a view that limits the number of records to be produced. We create Horizontal Views by adding WHERE clause.

USE AdventureWorks2014;
GO

CREATE VIEW dbo.Sales_2012
AS
SELECT * FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) = 2012;
GO

Vertical Views

This view results set of selected columns from one or more tables. We implement this by selecting required columns, optionally adding the WHERE clause.

USE AdventureWorks2014;
GO

CREATE VIEW dbo.Sales_2012_Summary
AS
SELECT SalesOrderNumber, OrderDate, SubTotal 
FROM Sales.SalesOrderHeader
WHERE YEAR(OrderDate) = 2012;
GO

In addition to that, there is another type called Grouped and joined views that is actually either a Horizontal View or Vertical View with aggregations and multiple joins, hence I do not think that it has to be taken as a major type.


Monday, June 5, 2017

What should be considered on Visualization?

Creating a report with stunning, meaningful and appropriate visuals is the key for delivering the information quickly, hence special consideration has to be made when designing reports. Otherwise, the created report will not be attractive and users will not see what they want and as they want.


There are few things need to be considered when designing reports;
  • Placement - you need to make sure that visuals added to the report are placed or positioned at the right place. How do we decide? Generally, most important visuals (A KPI for an example) should be placed in the top-left corner of the report because it is the place everyone looks at immediately. Not only that, when the report is opened using a mobile device, this makes sure that the important ones are displayed first. In addition to that, make sure the space between visuals is consistence and sized appropriately.
  • Appropriate Visuals - If something can be displayed using a Column Chart does not mean that it is the best visual for showing it. Therefore, right visual should be selected for displaying information. For an example, if just one value has to be shown, something like Revenue for the month, a visual like KPI or Card can be used. If something needs to be shown with values against a set of parameters, something like, showing revenue by month should be displayed with Column Chart. If something needs to be shown against set of parameters using percentage, not values, Pie or Donut will be appropriate.
  • Story Telling - Make sure that you report contains relevant and related information only, it should not be cluttered. If the report is created for Sales Department, make sure it shows information related to it. Not only that, if the report contains set of visuals related another area, example, a process, highlight it or group them. You can add a unfilled rectangle covering visuals related.
  • Formatting - The size of the visual, font used, font size, color and labels, all matters. Make sure that the title is properly displayed for visuals and you can make some visuals large and some small based on the importance of them. Same goes for font size. If it is a KPI and it is most important one, increasing the font size of it will definitely add a value to the report. Another important thing in formatting is, showing numerals properly. No need to show the whole number unless the visual is something like table. Generally, if the number is really large, consider to show it in thousands or millions.
  • Coloring - There are two things to consider with coloring; background color and colors on visuals. Generally light background color will do a lot as visuals can be highlighted with most of the colors. Therefore it is recommended to make the background light, something like white. However, making the background dark has become a trend because it makes the dashboard elegant. But when selecting colors for visuals, you need to be very careful as dark colors in visuals may not blend properly with a dark background. Another consideration is, company theme. Your company may have a color theme for all web applications and you might have to stick to it. If so, select appropriate colors for visuals based on the theme.
    What about colors used with visuals? One thing you need to remember is, the standard color-set used with BI. Generally, we use Green for Excellent, Amber for Neutral and Red for worse. Therefore, you should use a color like Red for indicating a success of something. Another thing is, do not use all these three colors with all types of visuals. For an example, when a KPI is shown, one color is used for either coloring font or background for the visual that is fine. Similarly, these three colors can be used with a tabular visual with all possible cells that is something you need to consider. You may see your table busy when many cells are colored with all three colors, hence it is better to color only one type of cells with one color.
There are many more tiny things to consider but above considered as the most important considerations. List anything if you think that I have missed some.

Sunday, June 4, 2017

Calculate the average value against a large table - SQL Server Brain Basher of the Week #068

Let's talk about another Interview Question related to SQL Server development. This is about one of the aggregate functions that is AVG. Here is the question;

You have a table called SalesOrderHeader and it has millions of records. It has a column called Freight and you need to the average of it. You need to make sure that only 10% of records is used for calculating the average.

What would be the best way? The standard answer is, write a query to get 10% of records using TOP and calculate the average.

SELECT AVG(t.Freight) AverageFreight
FROM (SELECT TOP 10 PERCENT Freight
 FROM Sales.SalesOrderHeader) t

But this might not give you the accurate average as you consider only first set of records. It will be more accurate if you pick records randomly and then calculate the average. You may add ORDER BY to your statement with NEW_ID function.

SELECT AVG(t.Freight) AverageFreight
FROM (SELECT TOP 10 PERCENT Freight
 FROM Sales.SalesOrderHeader
 ORDER BY NEWID()) t

Second method gives the most accurate value as it picks records randomly. However the cost is high with the statement. There is another way to achieve the same. It is using TABLESAMPLE operator.

SELECT AVG(Freight) AverageFreight
FROM Sales.SalesOrderHeader
TABLESAMPLE (10 PERCENT)

Here is the comparison between both methods. Notice the total cost.


As you see, TABLESAMPLE gives better performance than the first method.

Since it picks records randomly, the average it returns different at each execution. If you need to same value for all your executions, REPEATABLE option has to be used with repeat_seed

SELECT AVG(Freight) AverageFreight
FROM Sales.SalesOrderHeader
TABLESAMPLE (10 PERCENT) REPEATABLE (1)

You will get the same average as long as the repeat_seed is same.

Saturday, June 3, 2017

Power BI - No page to enter SQL Server credentials when connecting - Here is the reason

Have you experienced that Power BI just connects with your SQL Server without requesting your login credentials? Or you might need to connect with your SQL Server using someone else credentials but Power BI continues to the next page without prompting the page related to credentials.

This is not a bug or issue, this is because of the cache maintained by Power BI Desktop.

When I try to connect with my SQL Server using Power BI Desktop, I get the initial page where I can enter server name and database name. But when I click on OK, it straightaway opens the page for table selection instead of showing the page related to authentication.


I know that this happens because I have connected to the same server before and my connection is cached. What if I want to connect the same using a different account?

Clearing the data source cache

It is possible to clear the saved information related to previously connected sources. You can open this setting using File menu -> Options and settings -> Data source settings. This page shows all connections you have made previously; Data sources in current file shows all connections you have made with opened file and Global permissions shows all connections you have made with all files


You can clear cached permissions by selecting the connection and clicking Clear Permissions. Once cleared, you should see the page that requests credentials for connecting with the source.


Friday, June 2, 2017

Power BI does not show all columns in SQL Server View

I experienced a strange issue with Power BI today, I might have missed something or there is something unknown to me. This is what I experienced; Power BI does not show all columns when trying to import records from a SQL Server view.

Just for simulating the issue, I created a view with AdventureWorks2014, using three tables, Product, ProductSubCategory, and ProductCategory, and named as vProduct.

USE [AdventureWorks2014]
GO

CREATE OR ALTER VIEW dbo.vProduct
AS
 SELECT p.ProductID, p.Name Product, s.Name SubCategory, c.Name Category
 FROM Production.Product p
  INNER JOIN Production.ProductSubcategory s
   ON p.ProductSubcategoryID = s.ProductSubcategoryID
  INNER JOIN Production.ProductCategory c
   ON s.ProductCategoryID = c.ProductCategoryID;
GO

Then I granted SELECT permission to Jack user.

GRANT SELECT ON dbo.vProduct TO Jack;
GO

If Jack executes a SELECT against the view, he sees records with all columns.


Now, if Jack connects to the database with Power BI, this is what he sees.


Anyone has experienced this? I made a search but could not find a solution. I might have made a mistake or missed something but could not figure it out. Appreciate if you can share your thoughts or a solution if you have already seen this and sorted out.

I used the latest Power BI version that is May 2017.

Note that I see the same result for both Import and Direct Query.