Dinesh's Blog :::: Being Compiled ::::: Table design

Showing posts with label Table design. Show all posts

Monday, September 11, 2017

SQL Server - Merging Partitions - Does it physically move data from a data file to another?

Partitioning is a common practice when maintaining large number of records in tables regardless of the database type: whether the database is an OLTP database or a data warehouse based on dimensional modeling. Partitioning requires a Partition Function that describes boundary values and number of partitions required, and Partition Scheme that assigns File Groups to partitions.

Here are few posts I have made on partitioning;

Fact Table: Clustered index on date key is enough or partitioning should be done with it?

Truncating Partitions - SQL Server 2016 TSQL Enhancements

Fact Loading: Switching a loaded temporary table into an empty partition of the fact table

Once a table is partitioned, based on the workloads, we add new boundary values for introducing new partitions to the table and we remove boundary values for combining partitions (or removing partitions). These two operations are done with two partition related functions: SPLIT and MERGE. Now the question is, when we remove a partition using the MERGE function, if partitions are distributed with different file groups (means different data files), does SQL Server moves data from one file to another? Do we really have to consider it?

Let's make a database and do a simple test. Let's create a database called AdventureWorksTest and add multiple file groups and data files.

-- Creating the test database
USE master;
GO

CREATE DATABASE AdventureWorksTest
GO

-- Add multiple file group and files
-- files are distributed
-- Add filegroup 1
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup1;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_01,
    FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\AdventureWorksTest_Data_01.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup1;

-- Add filegroup 2
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup2;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_02,
    FILENAME = 'C:\Databases\AdventureWorksTest_Data_02.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup2;

-- Add filegroup 3
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup3;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_03,
    FILENAME = 'D:\Databases\AdventureWorksTest_Data_03.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup3;

-- Add filegroup 4
ALTER DATABASE AdventureWorksTest
ADD FILEGROUP FileGroup4;
GO
ALTER DATABASE AdventureWorksTest
ADD FILE 
(
    NAME = AdventureWorksTest_Data_04,
    FILENAME = 'E:\Databases\AdventureWorksTest_Data_04.ndf',
    SIZE = 5MB,
    MAXSIZE = 100MB,
    FILEGROWTH = 5MB
)
TO FILEGROUP FileGroup4;
GO

As you see, we have five file groups now.

Let's create a Partitioned Table and load it using AdventureWorks2014 Sales.SalesOrderHeader table. Note that, below code creates a Partition Function with 2014, 2015, 2016 and 2017 as boundary values. And it creates the Partition Scheme setting with multiple file groups.

-- Creating parition function and scheme
USE AdventureWorksTest;
GO

CREATE PARTITION FUNCTION pf_Year (datetime) 
AS RANGE right
FOR VALUES ('2014-01-01 00:00:00.000', '2015-01-01 00:00:00.000',  '2016-01-01 00:00:00.000',  '2017-01-01 00:00:00.000');
GO

CREATE PARTITION SCHEME ps_Year 
AS PARTITION pf_Year 
TO ([PRIMARY], FileGroup1, FileGroup2, FileGroup3, FileGroup4);
GO



-- Creating a partitioned table and loading from AdventureWorks2014 database
-- Note that sales dates of AdventureWorks2014 database have been updated for year 2014 to 2017
CREATE TABLE Sales
(
 SalesOrderID int NOT NULL,
 RevisionNumber tinyint NOT NULL DEFAULT (0),
 OrderDate datetime NOT NULL DEFAULT GETDATE(),
 DueDate datetime NOT NULL,
 ShipDate datetime NULL,
 [Status] tinyint NOT NULL DEFAULT(1),
 SubTotal money NOT NULL DEFAULT(0.00),
 TaxAmt money NOT NULL DEFAULT (0.00),
 Freight money NOT NULL DEFAULT (0.00),
 TotalDue money,
 Comment nvarchar(128) NULL,
 rowguid uniqueidentifier ROWGUIDCOL NOT NULL DEFAULT NEWID(),
 ModifiedDate datetime NOT NULL DEFAULT GETDATE()
) 
ON ps_Year(OrderDate);
GO

INSERT INTO Sales
  (SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, [Status]
  , SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate)
SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, [Status]
  , SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate
FROM AdventureWorks2014.Sales.SalesOrderHeader;

-- Checking records;
SELECT  $PARTITION.pf_Year(OrderDate) PartitionNo, *
FROM Sales;

Here is the result of the query.

Here is another query to see how data is distributed with each file group.

-- Get the record count for each partition/file group
SELECT t.name AS TableName,
  p.partition_number PartitionNumber,
  COALESCE(f.name, d.name) FileGroupName, 
  SUM(p.rows) NumberOfRecords,
  SUM(a.total_pages) NumberOfDataPages
FROM sys.tables AS t
 INNER JOIN sys.indexes i 
  ON i.object_id = t.object_id
 INNER JOIN sys.partitions p 
  ON p.object_id = t.object_id 
   AND p.index_id = i.index_id
 INNER JOIN sys.allocation_units a 
  ON a.container_id = p.partition_id
 LEFT OUTER JOIN sys.filegroups f 
  ON f.data_space_id = i.data_space_id
 LEFT OUTER JOIN sys.destination_data_spaces AS dds 
  ON dds.partition_scheme_id = i.data_space_id 
   AND dds.destination_id = p.partition_number
 LEFT OUTER JOIN sys.filegroups d 
  ON d.data_space_id = dds.data_space_id
WHERE t.[type] = 'U' AND i.index_id IN (0, 1) AND t.name LIKE 'Sales'
GROUP BY t.name, p.partition_number, COALESCE(f.name, d.name) 
ORDER BY t.name, p.partition_number;

If I illustrate the same in different manner;

This is how the data is distributed with my system now. If I try to delete a data file, example, AdventureWorksTest_Data_03 that is belong to FileGroup3, SQL Server will not allow me to do as it holds data.

If I decide to combine 2015 and 2016, in other words, if I want to remove the 4th partition that holds 2016 data, what will happen. Let's do and see. The below codes merge the partition;

ALTER PARTITION FUNCTION pf_Year()
MERGE RANGE ('2016-01-01 00:00:00.000');

Once delete, let me run the same query to see how data is partitioned now.

As you see, FileGroup3 is no longer used by the table and number of records in the FileGroup2 has been increased. The partition which was 5 is now 4. This means, data that was maintained with FileGroup3 (or AdventureWorksTest_Data_03) has been moved to FileGroup2 (or AdventureWorksTest_Data_02). This is what has happened;

This clearly shows that data is moved when partitions are merged. Okay, do we have to really consider about this? Yes, we have to, when this is a small table, we do not see any performance issue but if the table is large, moving data from one file to another will surely take time, hence, think twice before merging, specifically when they are distributed in multiple files.

** Now, if you need, you can delete the AdventureWorksTest_Data_03 file and SQL Server will allow you to delete as it is empty.

Friday, September 8, 2017

SQL Server - Dropping Primary Key will drop the Clustered Index?

Everyone knows that SQL Server creates a Clustered Index when we add a Primary Key if there is no Clustered Index already exist in the table. It adds a Non-Clustered Key for the Primary Key if we have already added a Clustered Index. However they are two different objects; one is a Constraint and other is an Index. What if I drop one object? Will it drop the other as well?

Let's make a test and see. If I create a table Customer like below, making Customer Key as the Primary Key;

As you see, it will create both Key and the Index.

Now if I drop either one, it will drop the other one as well. For an example, if I drop the Primary Key, it will drop the Index as well.

If I need to make sure that it does not happen, I can create them separately, first create the Clustered Index on CustomerKey and then make the CustomerKey as the Primary Key. However, that will add another index specifically for the Primary Key.

The reason for above behavior is, Primary Key needs an Index. It is always associated with an index therefor if one is getting dropped, the associate also getting dropped.

Thursday, April 13, 2017

SQL Server Splits the page when I change an Integer Column Value

I have written few posts on Index Fragmentation and how they can be checked. External Fragmentation occurs as a result of Page-Split and it happens when there is no space in the page to accommodate the change (new record or an update). When no space, it moves approximately 50% of records to a new page, accommodating the change to relevant page.

Read Understanding Index Fragmentation – SS SLUG Aug 2013 – Demo I post for understanding fragmentation.

Read SQL Server - Checking Index Fragmentation - Understanding LIMITED, SAMPLED and DETAILED for understanding various modes that can be used for checking fragmentation.

For understanding page-split, lets take an example. Have a look on below table.

-- creating test table
CREATE TABLE dbo.TestTable
(
    Id int PRIMARY KEY
    , Value char(4000) NOT NULL
)
GO
 
-- inserting 2 records
INSERT INTO dbo.TestTable
    VALUES 
    (2, replicate('a', 4000)), (5, replicate('b', 4000));

It has two records. Since the approximate size of a record is 4004 bytes (4 for int column and 4000 for char column), SQL Server can use one page for holding both records. The size of a data page is 8KB. The below query proves that both records are in the same page.

SELECT Id, Value, ph.*
FROM dbo.TestTable
CROSS APPLY fn_PhysLocCracker(%%PHYSLOC%%) ph;

If I make a change to Value column, SQL Server does not need to split the page as the current value uses 4000 bytes. Example, if I change the first record value as REPLICATE('x', 4000), it can simply replace the old value, hence no page split. See below code and the result.

UPDATE dbo.TestTable
    SET Value = replicate('x', 4000)
WHERE id = 2
 
SELECT Id, Value, ph.*
FROM dbo.TestTable
CROSS APPLY fn_PhysLocCracker(%%PHYSLOC%%) ph;

Just like that, if we make a change to the integer value, we should not see any change on pages used. Let's make a small change and see.

UPDATE dbo.TestTable
    SET Id = 3
WHERE id = 2;
 
SELECT Id, Value, ph.*
FROM dbo.TestTable
CROSS APPLY fn_PhysLocCracker(%%PHYSLOC%%) ph;

Did you notice the change? Why the second record has been moved to a new page? Mean Page-Split has been occurred with this changed?

Is it because of the data type? It is not. Note that this is not just an integer column. It is the PRIMARY KEY; The clustered key. What happened was, SQL Server updated the Value column as an in-place update. It does not require additional space. However SQL Server does not perform in-place update for key columns. It performs delete-plus-insert operation for key column updates. It requires space for the INSERT. That is the reason for the page split. Understand that this behavior is not only for integer, it is for all key columns.

This is why we discourage you to select a column as the key column if it can be changed by a business operation. It is always recommended to use a non-volatile column as the key column that do not expect changes on values.

Monday, February 27, 2017

NULL values consume storage in SQL Server Tables?

While we were discussing on data types to be used and null-able columns, a question was raised, asking whether the space is used for NULLs as SQL Server uses for other data types. My answer was Yes and No because it depends on the data type we have used.

NULL is not exactly a value. It indicates that the value is unknown hence it requires some bits/bytes to maintain it. However, if I set one of my columns value as NULL, can I assume that it will not use the space that data type suppose to use? For example, if I have a column with data type int that uses 4 bytes per value and I inserted a record with NULL for the column, will SQL Server still uses 4 bytes or few bits for the NULL?

It is always better to write some codes for testing and come to a conclusion. Therefore, let's test this with four tables. The below code creates;

Customer_Without_NullValues_FixedWidthType table
Customer_With_NullValues_FixedWidthType table
Customer_Without_NullValues_VaryWidthType table
Customer_With_NullValues_VaryWidthType table

The last three columns of First and Second tables are set with date data type and all are null-able. And last three columns of Third and Forth tables are set with varchar(4000) data type and all are null-able.

USE tempdb;
GO

CREATE TABLE dbo.Customer_Without_NullValues_FixedWidthType
(
 Id int identity(1,1) PRIMARY KEY
 , Name varchar(100) NOT NULL
 , DateOfBirth date NULL
 , RegisteredDate date NULL
 , LastPurchasedDate date NULL
);
GO

CREATE TABLE dbo.Customer_With_NullValues_FixedWidthType
(
 Id int identity(1,1) PRIMARY KEY
 , Name varchar(100) NOT NULL
 , DateOfBirth date NULL
 , RegisteredDate date NULL
 , LastPurchasedDate date NULL
);
GO


CREATE TABLE dbo.Customer_Without_NullValues_VaryWidthType
(
 Id int identity(1,1) PRIMARY KEY
 , Name varchar(100) NOT NULL
 , Details1 varchar(4000) NULL
 , Details2 varchar(4000) NULL
 , Details3 varchar(4000)  NULL
);
GO

CREATE TABLE dbo.Customer_With_NullValues_VaryWidthType
(
 Id int identity(1,1) PRIMARY KEY
 , Name varchar(100) NOT NULL
 , Details1 varchar(4000) NULL
 , Details2 varchar(4000) NULL
 , Details3 varchar(4000)  NULL
);
GO

Next code insert 100,000 records for each table. However, last three columns of Second and Forth tables are filled with NULLs instead of known values.

INSERT INTO dbo.Customer_Without_NullValues_FixedWidthType
 (Name, DateOfBirth, RegisteredDate, LastPurchasedDate)
 VALUES
 ('a', getdate(), getdate(), getdate());

INSERT INTO dbo.Customer_With_NullValues_FixedWidthType
 (Name, DateOfBirth, RegisteredDate, LastPurchasedDate)
 VALUES
 ('a', null, null, null);

INSERT INTO dbo.Customer_Without_NullValues_VaryWidthType
 (Name, Details1, Details2, Details3)
 VALUES
 ('a', REPLICATE('a', 4000), REPLICATE('a', 4000), REPLICATE('a', 4000));

INSERT INTO dbo.Customer_With_NullValues_VaryWidthType
 (Name, Details1, Details2, Details3)
 VALUES
 ('a', null, null, null);

GO 100000

In order to see the space usage, easiest way is, check number of pages read for data retrieval.

As you see, space usage of First and Second table is same regardless of the value stored. I means, Fixed data types need the defined space whether the value is null or not. However, Third and Forth clearly shows that it is not the same with data type with vary length. When the data type is vary in length, it does not need the space defined with the type.

Sunday, February 12, 2017

Converting Attributes to Columns - SQL Server Brain Basher of the Week #065

Let's talk about something common for all database management systems without talking something specific to SQL Server. This is the interview question of the week and let me start it with a conversation I had with a interviewee.

image taken from: https://www.pinterest.com/thepotentmix/recruitment-cartoons-and-other-entertainments/

Me: Let's assume that you have been given a document that contains identified attributes for a particular entity, for an example, Customer Entity. Business Analyst has mentioned that it needs an attribute called Customer Name. How do you take this entity and design your database table?

Interviewee: All I have to do is, understand given attributes and create a table call Customer with relevant columns. We need to make sure that the right data type is selected for each every attribute or the column.

Me: How do you decide the data type? Let's talk about this specific attribute: Customer Name.

Interviewee: Yes, I will be setting varchar for this attribute, probably with the size as 200. Another thing, if the solution is a multilingual application, have to use nvarchar instead of varchar.

Me: Good, anything else to be considered on that?

Interviewee: Regarding Customer Name, I think that is all I have to consider. We might have to change the size of it but 200 is reasonable.

Okay, what do you think? He is not a senior person, that is what I started with basic but I expected something additional, something extra.

First thing you need to understand is, BA does not know or does not think about database design. It is all about the business. It is our duty to covert the business identified to logical and physical database design. So, even though BA has identified Customer Name as an attribute, or as a single attribute, it does not mean that we need to stick into it. This is where we apply normalization rules. Not only that, we need to think and see;

How this attribute is going to be filled
How this attributed is going to be read
How often the entity is searched based on this attribute.

For an example, Customer Entity (or the table) might be searched by Customer Last Name. Have we addressed this requirement. Do not say that BA has not mentioned it, if not mentioned, it is always good to get it clarified but it is something we need to consider.

If we consider other common things whether BA has mentioned it or not. Customer Name is not going to be a single column. You will surely have at least two columns called First Name and Last Name. This makes sure that your table is designed properly and it is ready for standard or most common analysis as well as holding data efficiently. Remember, breaking an attribute to multiple columns is not only for known attributes such as Employee Name, Location. We might break attributes like Product Code, Invoice Number as well. For example, Product Code might be forming using two elements such as Product Type Code and Unique Number. In that case, it is better to have two columns for Product Type Code and Unique Number, and another Computed Column for Product Code (if required).

Make sense? Although this is a simple thing, many miss it, hence DO NOT FORGET SIMPLE THINGS AND FUNDAMENTAL RELATED TO DATABASE DESIGN :).

Wednesday, January 18, 2017

SQL Server Identity value suddenly became 1000+

I noticed that values of a column related to newly inserted records which is an identity property enabled column in one of the tables, suddenly started from 1000+. The last values before this sudden jump were 27, 28, 29 but suddenly it started from 1000+ values. How did it happen without resetting the seed or increment of the identity property?

This is not a bug or something related to SQL Server 2016. It a result of a new implementation related to SQL Server 2012. As per my readings, since SQL Server 2012, it caches 1000 values for an identity property if the data type is int and it caches 10,000 values for big int and numeric. If an unexpected restart happens (in my case, I believe that an update caused the restart), there is a possibility to lose the cache, hence it starts from next available value. This is normal and you do not need to worry if you experience it.

Enabling Trace flag 272, you can get the old behavior or you can use Sequence instead. Read this thread for understanding these two options: http://stackoverflow.com/questions/17587094/identity-column-value-suddenly-jumps-to-1001-in-sql-server

Sunday, January 8, 2017

Table Design Considerations - SQL Server Brain Basher of the Week #063

He is the Brain Basher or the interview question of the week. This is something every database engineer should know and it is not limited to database administrators.

What would be the considerations you make when you are asked to create a table for holding data?

Let me elaborate the question more. How do you start designing of the table when you are asked to design it on a particular entity. If you are an ordinary engineer who thinks traditionally, your answer could be;

"Need to check the requirements and create a script with relevant columns, and apply normalization as need."

Do you think that the answer is similar to cartoon shown above?

Being an interviewer, I prefer to hear something more, something that explains the interviewee's experience. Let's point out some valuable considerations we have to make when a table has to be designed.

Read the attributes identified by the BA and understand the business requirement - yes, you can think that this not something we should concern but trust me, you need to know.
Analyze and see whether this needs as a normalized table or denormalized table. Based on the requirement, decide the level like how far we need to go through normalization rules.
Decided the data types of the attributes and size of them. Need to pick the best, thinking present and future too. Should we think about past data as well. Yes, we should if we need to load legacy data.
Consider the datetime format and whether it needs to maintain Unicode characters.
Need to analyze whether it needs partitions and partitions need to be distributed among multiple file groups (preferably multiple volumes).
Constraints required. In addition to the Primary Key, we may add additional constraints such as Foreign Key, Unique Key and some rules.
Whether Artificial or Surrogate Key is required.
Number of indexes required, whether we need Columnstore or not, order of them, whether they need any filters.
Whether compression is required.

There are few other tiny things as well but these things show that the Interviewee is smart and knowledgeable. Remember, it is not just these points, knowing details of them is a must.

Monday, December 28, 2015

GUID or Uniqueidentifier as the key - II - Issues with it

My first post related to GUID discussed the main issue we face when using GUID as the key, specifically as the Clustered Key. Here is another concern we should consider when maintaining a table with GUID as the key. This issue does need to be considered as a critical issue if the table contains low number of records. But if it contains millions of records, this is something we need to consider for sure.

Let's create two tables just as the first post. The below code creates two tables named dbo.Sales_WithGUID and dbo.Sales_WithoutGUID and loads same data set for both tables.

USE tempdb;
GO

IF OBJECT_ID('dbo.Sales_WithGUID') IS NOT NULL
 DROP TABLE dbo.Sales_WithGUID;
GO

-- Creating the table
CREATE TABLE dbo.Sales_WithGUID
(
 SalesKey uniqueidentifier PRIMARY KEY DEFAULT(NewID()),
 ProductKey int NOT NULL,
 OrderDateKey int NOT NULL,
 DueDateKey int NOT NULL,
 ShipDateKey int NOT NULL,
 CustomerKey int NOT NULL,
 PromotionKey int NOT NULL,
 CurrencyKey int NOT NULL,
 SalesTerritoryKey int NOT NULL,
 SalesOrderNumber nvarchar(20) NOT NULL,
 SalesOrderLineNumber tinyint NOT NULL,
 RevisionNumber tinyint NOT NULL,
 OrderQuantity smallint NOT NULL,
 UnitPrice money NOT NULL,
 ExtendedAmount money NOT NULL,
 UnitPriceDiscountPct float NOT NULL,
 DiscountAmount float NOT NULL,
 ProductStandardCost money NOT NULL,
 TotalProductCost money NOT NULL,
 SalesAmount money NOT NULL,
 TaxAmt money NOT NULL,
 Freight money NOT NULL,
 CarrierTrackingNumber nvarchar(25) NULL,
 CustomerPONumber nvarchar(25) NULL,
 OrderDate datetime NULL,
 DueDate datetime NULL,
 ShipDate datetime NULL
);
GO

INSERT INTO dbo.Sales_WithGUID
  (
   ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
    ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
    ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
    ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
    ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
  )
SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
      ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
      ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
      ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
      ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
FROM AdventureWorksDW2014.dbo.FactInternetSales;


IF OBJECT_ID('dbo.Sales_WithoutGUID') IS NOT NULL
 DROP TABLE dbo.Sales_WithoutGUID;
GO

-- Create the second table
CREATE TABLE dbo.Sales_WithoutGUID
(
 SalesKey int identity(1,1),
 CompanyCode char(2) Default ('SL'),
 constraint pk_Sales_WithoutGUID primary key (SalesKey, CompanyCode),
 ProductKey int NOT NULL,
 OrderDateKey int NOT NULL,
 DueDateKey int NOT NULL,
 ShipDateKey int NOT NULL,
 CustomerKey int NOT NULL,
 PromotionKey int NOT NULL,
 CurrencyKey int NOT NULL,
 SalesTerritoryKey int NOT NULL,
 SalesOrderNumber nvarchar(20) NOT NULL,
 SalesOrderLineNumber tinyint NOT NULL,
 RevisionNumber tinyint NOT NULL,
 OrderQuantity smallint NOT NULL,
 UnitPrice money NOT NULL,
 ExtendedAmount money NOT NULL,
 UnitPriceDiscountPct float NOT NULL,
 DiscountAmount float NOT NULL,
 ProductStandardCost money NOT NULL,
 TotalProductCost money NOT NULL,
 SalesAmount money NOT NULL,
 TaxAmt money NOT NULL,
 Freight money NOT NULL,
 CarrierTrackingNumber nvarchar(25) NULL,
 CustomerPONumber nvarchar(25) NULL,
 OrderDate datetime NULL,
 DueDate datetime NULL,
 ShipDate datetime NULL
);
GO

INSERT INTO dbo.Sales_WithoutGUID
  (
   ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
    ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
    ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
    ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
    ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
  )
SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
      ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
      ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
      ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
      ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
FROM AdventureWorksDW2014.dbo.FactInternetSales;

Let's check the space used by both tables.

As you see, first table which is based on GUID consumes 10.7MB for data and 80KB for indexes. Second table which is based on int+char(2) as the key consumes 10.0MB for data and 56KB for indexes. Now, let's additional non-clustered indexes for both tables.

CREATE INDEX IX_Sales_WithGUID_ShipDateKey ON dbo.Sales_WithGUID (ShipDateKey);
CREATE INDEX IX_Sales_WithGUID_CurrencyKey ON dbo.Sales_WithGUID (CurrencyKey);
CREATE INDEX IX_Sales_WithGUID_CustomerKey ON dbo.Sales_WithGUID (CustomerKey);
CREATE INDEX IX_Sales_WithGUID_DueDateKey ON dbo.Sales_WithGUID (DueDateKey);
CREATE INDEX IX_Sales_WithGUID_OrderDateKey ON dbo.Sales_WithGUID (OrderDateKey);
CREATE INDEX IX_Sales_WithGUID_ProductKey ON dbo.Sales_WithGUID (ProductKey);
CREATE INDEX IX_Sales_WithGUID_PromotionKey ON dbo.Sales_WithGUID (PromotionKey);

CREATE INDEX IX_Sales_WithoutGUID_ShipDateKey ON dbo.Sales_WithoutGUID (ShipDateKey);
CREATE INDEX IX_Sales_WithoutGUID_CurrencyKey ON dbo.Sales_WithoutGUID (CurrencyKey);
CREATE INDEX IX_Sales_WithoutGUID_CustomerKey ON dbo.Sales_WithoutGUID (CustomerKey);
CREATE INDEX IX_Sales_WithoutGUID_DueDateKey ON dbo.Sales_WithoutGUID (DueDateKey);
CREATE INDEX IX_Sales_WithoutGUID_OrderDateKey ON dbo.Sales_WithoutGUID (OrderDateKey);
CREATE INDEX IX_Sales_WithoutGUID_ProductKey ON dbo.Sales_WithoutGUID (ProductKey);
CREATE INDEX IX_Sales_WithoutGUID_PromotionKey ON dbo.Sales_WithoutGUID (PromotionKey);

Here is the result now.

Focus on the space used for indexes. First table uses 11MB while second table uses only 7MB. Note that this is only for 60,000 records but if the table has more records, gap will be significantly huge.

How can it be happened when we have same number of records and same types of indexes. This is because of every non-clustered key value is maintained with the clustered key. When it comes to the first table, since it is 16 bytes, the size of the non-clustered key value will be 16+4. But with the second table, it is 6+4 bytes. This is the reason for high usage of the spave for non-clustered indexes.

Here is the code that shows the number of data pages used by each index. Large number of pages increases IO operations and usage of resources.

SELECT OBJECT_NAME (i.object_id) AS TableName 
 , i.name AS IndexName
 , ps.index_id AS IndexID 
 , ps.alloc_unit_type_desc AS DataStructure
 , ps.page_count AS Pages
 , ps.record_count AS Rows
FROM sys.indexes AS i
 CROSS APPLY sys.dm_db_index_physical_stats 
  (DB_ID ()
  , i.object_id
  , NULL
  , NULL
  , N'DETAILED') AS ps
WHERE i.object_id = ps.object_id
  AND i.index_id = ps.index_id
  AND i.object_id 
   IN (OBJECT_ID (N'Sales_WithGUID')
    , OBJECT_ID (N'Sales_WithoutGUID'))
  AND ps.index_level = 0 -- only lef level
ORDER BY TableName, IndexID;
GO

Now you see issue we may face when using GUIDs, hence use it knowing these issues and manage it properly if it is really required.

Saturday, December 26, 2015

GUID or Uniqueidentifier as the key - I - Issues with it

While analyzing a database related to one of the companies, I noticed that the key of most tables is based on GUID (uniqueidentifier) that drastically hinders the performance unless it is managed properly. Unfortunately, as per my experience in consultancy, 90% of databases that use GUID as the key for tables are not properly managed and many are unaware of the damages it does though they experience poor performance on data manipulation and retrieval.

Most reason out that they need GUID as the key because of a need of an uniqueness among records specifically when records are joined from multiple companies or regions. But with many cases, it can be handled without using Unqiueidentifier but using a composite key that is formed using an integer value and a code that represents the company or region. Understanding the cost of it and understanding how it slows down your queries will surely guide you to take the right decision, hence here is a post on it.

Let's try to understand the most common issue related to GUID as the key. Following code creates a table called Sales_WithGUID and loads data from AdventureWorks. The key of the table, which is SalesKey is a GUID. Note that I do not load records as a batch, but load one by one simulating the real operation.

USE tempdb;
GO

IF OBJECT_ID('dbo.Sales_WithGUID') IS NOT NULL
 DROP TABLE dbo.Sales_WithGUID;
GO

-- Creating the table
CREATE TABLE dbo.Sales_WithGUID
(
 SalesKey uniqueidentifier PRIMARY KEY DEFAULT(NewID()),
 ProductKey int NOT NULL,
 OrderDateKey int NOT NULL,
 DueDateKey int NOT NULL,
 ShipDateKey int NOT NULL,
 CustomerKey int NOT NULL,
 PromotionKey int NOT NULL,
 CurrencyKey int NOT NULL,
 SalesTerritoryKey int NOT NULL,
 SalesOrderNumber nvarchar(20) NOT NULL,
 SalesOrderLineNumber tinyint NOT NULL,
 RevisionNumber tinyint NOT NULL,
 OrderQuantity smallint NOT NULL,
 UnitPrice money NOT NULL,
 ExtendedAmount money NOT NULL,
 UnitPriceDiscountPct float NOT NULL,
 DiscountAmount float NOT NULL,
 ProductStandardCost money NOT NULL,
 TotalProductCost money NOT NULL,
 SalesAmount money NOT NULL,
 TaxAmt money NOT NULL,
 Freight money NOT NULL,
 CarrierTrackingNumber nvarchar(25) NULL,
 CustomerPONumber nvarchar(25) NULL,
 OrderDate datetime NULL,
 DueDate datetime NULL,
 ShipDate datetime NULL
);
GO

-- Loading records one by one
DECLARE cur CURSOR 
FOR
SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
      ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
      ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
      ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
      ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
  FROM AdventureWorksDW2014.dbo.FactInternetSales;

DECLARE @ProductKey int 
  , @OrderDateKey int
  , @DueDateKey int
  , @ShipDateKey int 
  , @CustomerKey int 
  , @PromotionKey int 
  , @CurrencyKey int 
  , @SalesTerritoryKey int 
  , @SalesOrderNumber nvarchar(20) 
  , @SalesOrderLineNumber tinyint 
  , @RevisionNumber tinyint 
  , @OrderQuantity smallint 
  , @UnitPrice money 
  , @ExtendedAmount money 
  , @UnitPriceDiscountPct float 
  , @DiscountAmount float 
  , @ProductStandardCost money 
  , @TotalProductCost money 
  , @SalesAmount money 
  , @TaxAmt money 
  , @Freight money 
  , @CarrierTrackingNumber nvarchar(25) 
  , @CustomerPONumber nvarchar(25)  
  , @OrderDate datetime  
  , @DueDate datetime  
  , @ShipDate datetime

OPEN cur

FETCH NEXT FROM cur INTO @ProductKey, @OrderDateKey, @DueDateKey, @ShipDateKey, @CustomerKey, @PromotionKey
      , @CurrencyKey, @SalesTerritoryKey, @SalesOrderNumber, @SalesOrderLineNumber, @RevisionNumber
      , @OrderQuantity, @UnitPrice, @ExtendedAmount, @UnitPriceDiscountPct, @DiscountAmount
      , @ProductStandardCost, @TotalProductCost, @SalesAmount, @TaxAmt, @Freight
      , @CarrierTrackingNumber, @CustomerPONumber, @OrderDate, @DueDate, @ShipDate

WHILE (@@FETCH_STATUS = 0)
BEGIN

 INSERT INTO dbo.Sales_WithGUID
  (
   ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
    ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
    ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
    ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
    ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
  )
 VALUES
  (@ProductKey, @OrderDateKey, @DueDateKey, @ShipDateKey, @CustomerKey, @PromotionKey
    , @CurrencyKey, @SalesTerritoryKey, @SalesOrderNumber, @SalesOrderLineNumber, @RevisionNumber
    , @OrderQuantity, @UnitPrice, @ExtendedAmount, @UnitPriceDiscountPct, @DiscountAmount
    , @ProductStandardCost, @TotalProductCost, @SalesAmount, @TaxAmt, @Freight
    , @CarrierTrackingNumber, @CustomerPONumber, @OrderDate, @DueDate, @ShipDate)

 FETCH NEXT FROM cur INTO @ProductKey, @OrderDateKey, @DueDateKey, @ShipDateKey, @CustomerKey, @PromotionKey
      , @CurrencyKey, @SalesTerritoryKey, @SalesOrderNumber, @SalesOrderLineNumber, @RevisionNumber
      , @OrderQuantity, @UnitPrice, @ExtendedAmount, @UnitPriceDiscountPct, @DiscountAmount
      , @ProductStandardCost, @TotalProductCost, @SalesAmount, @TaxAmt, @Freight
      , @CarrierTrackingNumber, @CustomerPONumber, @OrderDate, @DueDate, @ShipDate
END

-- Cleaning
CLOSE cur
DEALLOCATE cur

Let's create another table called Sales_WithoutGUID and loads the same. Note that its primary key is a combination of int and char column. This is what I explained above and suggested to the company: using two columns as a composite primary key rather than using uniqueidentifier. Using this composite key, company can be identified and records can be uniquely identified.

USE tempdb;
GO

IF OBJECT_ID('dbo.Sales_WithoutGUID') IS NOT NULL
 DROP TABLE dbo.Sales_WithoutGUID;
GO

-- Create the second table
CREATE TABLE dbo.Sales_WithoutGUID
(
 SalesKey int identity(1,1),
 CompanyCode char(2) Default ('SL'),
 constraint pk_Sales_WithoutGUID primary key (SalesKey, CompanyCode),
 ProductKey int NOT NULL,
 OrderDateKey int NOT NULL,
 DueDateKey int NOT NULL,
 ShipDateKey int NOT NULL,
 CustomerKey int NOT NULL,
 PromotionKey int NOT NULL,
 CurrencyKey int NOT NULL,
 SalesTerritoryKey int NOT NULL,
 SalesOrderNumber nvarchar(20) NOT NULL,
 SalesOrderLineNumber tinyint NOT NULL,
 RevisionNumber tinyint NOT NULL,
 OrderQuantity smallint NOT NULL,
 UnitPrice money NOT NULL,
 ExtendedAmount money NOT NULL,
 UnitPriceDiscountPct float NOT NULL,
 DiscountAmount float NOT NULL,
 ProductStandardCost money NOT NULL,
 TotalProductCost money NOT NULL,
 SalesAmount money NOT NULL,
 TaxAmt money NOT NULL,
 Freight money NOT NULL,
 CarrierTrackingNumber nvarchar(25) NULL,
 CustomerPONumber nvarchar(25) NULL,
 OrderDate datetime NULL,
 DueDate datetime NULL,
 ShipDate datetime NULL
);
GO

-- Loading
DECLARE cur CURSOR 
FOR
SELECT ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
      ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
      ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
      ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
      ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
  FROM AdventureWorksDW2014.dbo.FactInternetSales;

DECLARE @ProductKey int 
  , @OrderDateKey int
  , @DueDateKey int
  , @ShipDateKey int 
  , @CustomerKey int 
  , @PromotionKey int 
  , @CurrencyKey int 
  , @SalesTerritoryKey int 
  , @SalesOrderNumber nvarchar(20) 
  , @SalesOrderLineNumber tinyint 
  , @RevisionNumber tinyint 
  , @OrderQuantity smallint 
  , @UnitPrice money 
  , @ExtendedAmount money 
  , @UnitPriceDiscountPct float 
  , @DiscountAmount float 
  , @ProductStandardCost money 
  , @TotalProductCost money 
  , @SalesAmount money 
  , @TaxAmt money 
  , @Freight money 
  , @CarrierTrackingNumber nvarchar(25) 
  , @CustomerPONumber nvarchar(25)  
  , @OrderDate datetime  
  , @DueDate datetime  
  , @ShipDate datetime

OPEN cur

FETCH NEXT FROM cur INTO @ProductKey, @OrderDateKey, @DueDateKey, @ShipDateKey, @CustomerKey, @PromotionKey
      , @CurrencyKey, @SalesTerritoryKey, @SalesOrderNumber, @SalesOrderLineNumber, @RevisionNumber
      , @OrderQuantity, @UnitPrice, @ExtendedAmount, @UnitPriceDiscountPct, @DiscountAmount
      , @ProductStandardCost, @TotalProductCost, @SalesAmount, @TaxAmt, @Freight
      , @CarrierTrackingNumber, @CustomerPONumber, @OrderDate, @DueDate, @ShipDate

WHILE (@@FETCH_STATUS = 0)
BEGIN

 INSERT INTO dbo.Sales_WithoutGUID
  (
   ProductKey, OrderDateKey, DueDateKey, ShipDateKey, CustomerKey, PromotionKey
    ,CurrencyKey, SalesTerritoryKey, SalesOrderNumber, SalesOrderLineNumber, RevisionNumber
    ,OrderQuantity, UnitPrice, ExtendedAmount, UnitPriceDiscountPct, DiscountAmount
    ,ProductStandardCost, TotalProductCost, SalesAmount, TaxAmt, Freight
    ,CarrierTrackingNumber, CustomerPONumber, OrderDate, DueDate, ShipDate
  )
 VALUES
  (@ProductKey, @OrderDateKey, @DueDateKey, @ShipDateKey, @CustomerKey, @PromotionKey
    , @CurrencyKey, @SalesTerritoryKey, @SalesOrderNumber, @SalesOrderLineNumber, @RevisionNumber
    , @OrderQuantity, @UnitPrice, @ExtendedAmount, @UnitPriceDiscountPct, @DiscountAmount
    , @ProductStandardCost, @TotalProductCost, @SalesAmount, @TaxAmt, @Freight
    , @CarrierTrackingNumber, @CustomerPONumber, @OrderDate, @DueDate, @ShipDate)

 FETCH NEXT FROM cur INTO @ProductKey, @OrderDateKey, @DueDateKey, @ShipDateKey, @CustomerKey, @PromotionKey
      , @CurrencyKey, @SalesTerritoryKey, @SalesOrderNumber, @SalesOrderLineNumber, @RevisionNumber
      , @OrderQuantity, @UnitPrice, @ExtendedAmount, @UnitPriceDiscountPct, @DiscountAmount
      , @ProductStandardCost, @TotalProductCost, @SalesAmount, @TaxAmt, @Freight
      , @CarrierTrackingNumber, @CustomerPONumber, @OrderDate, @DueDate, @ShipDate
END

-- Cleaning
CLOSE cur
DEALLOCATE cur

Now let's analyze and see. The below code has two SELECT statements that check the fragmentation of both tables. As you see, first table is Externally Fragmented and second table is not fragmented. This is the biggest and most common issue with GUIDs. Since there is no sequential values when generating, when inserting records, in order to maintain the clustered key order, SQL Server splits data pages for moving records to new page, causing External Fragmentation (Read more on fragmentation at: http://dinesql.blogspot.com/2013/08/understanding-index-fragmentation-ss.html). Because of this, reading causes accessing larger number of data pages, slowing down the query and consuming resource.

Other issue is the space required. As you see, second table requires only 10MB for maintaining records while first table requires 15MB.

There is another considerable impact on non-clustered indexes if you, will make another post on it.

As you see the disadvantages with GUID now, try to minimize the usage of it if possible. There is a workaround if the order of the key is only the issue. The function NEWSEQUENTIALID can be used instead of NEWID for generating GUIDs but still the second issue discussed is remained unsorted.

SELECT * FROM sys.dm_db_index_physical_stats 
 (DB_ID(), OBJECT_ID(N'dbo.Sales_WithGUID', N'U'), 1, NULL, 'DETAILED')
SELECT * FROM sys.dm_db_index_physical_stats 
 (DB_ID(), OBJECT_ID(N'dbo.Sales_WithoutGUID', N'U'), 1, NULL, 'DETAILED')

EXEC sp_spaceused N'Sales_WithGUID';
EXEC sp_spaceused N'Sales_WithoutGUID';