Saturday, May 23, 2015

Fact Table: Clustered index on date key is enough or partitioning should be done with it?

Fact table in a data warehouse (or a large table in an OLTP database) always holds millions of records and we keep on adding records largely. Generally, it is discouraged to add indexes and constraints to Fact Tables but with modern way of implementing data warehouse / BI solutions, if the table is open for business users for analysis, having indexes on Fact Tables would be beneficial. However, whether it is a traditional implementation or modern implementation, it is still recommended to have a clustered index on most frequent key used for analysis and loading which is Date Key.

While discussing Design Considerations of Fact Tables, a question came up: Clustered index on date key is enough or should we partition the table based on date key too? Do we get additional performance benefits by partitioning on the same?

In terms of performance, if the table is not significantly large, I doubt that clustered index on date key and partitioning on the same will give extra performance benefits. But if it is a very large table, performance of data retrieval will be surely improved. Does this mean that Partitioning is not required for Fact Tables?

Let's analyze with a slightly large table. Below code shows a structure of InternetSales table that contains 7 millions records. It has no indexes, not even a clustered index.

-- Table structure
CREATE TABLE dbo.InternetSales(
 OrderDateKey int NOT NULL,
 ProductKey int NOT NULL,
 DueDateKey int NOT NULL,
 ShipDateKey int NOT NULL,
 CustomerKey int NOT NULL,
 PromotionKey int NOT NULL,
 CurrencyKey int NOT NULL,
 SalesTerritoryKey int NOT NULL,
 SalesOrderNumber nvarchar(20) NOT NULL,
 SalesOrderLineNumber tinyint NOT NULL,
 RevisionNumber tinyint NOT NULL,
 OrderQuantity smallint NOT NULL,
 UnitPrice money NOT NULL,
 ExtendedAmount money NOT NULL,
 UnitPriceDiscountPct float NOT NULL,
 DiscountAmount float NOT NULL,
 ProductStandardCost money NOT NULL,
 TotalProductCost money NOT NULL,
 SalesAmount money NOT NULL,
 TaxAmt money NOT NULL,
 Freight money NOT NULL,
 CarrierTrackingNumber nvarchar(25) NULL,
 CustomerPONumber nvarchar(25) NULL,
 OrderDate datetime NULL,
 DueDate datetime NULL,
 ShipDate datetime NULL
) ON PRIMARY;
GO

If business user runs a query like below against the table, you can witness the number of pages (IO) have to be read for satisfying the query.

SET STATISTICS IO ON
GO

SELECT 
 DATENAME(m,[OrderDate])
 , p.EnglishProductName Product
  , SUM(SalesAmount) AS TotalSales
  , SUM(TaxAmt) AS TotalTaxAmount
FROM dbo.InternetSales f
 INNER JOIN dbo.DimProduct p
  ON f.ProductKey = p.ProductKey
 INNER JOIN dbo.DimDate d
  ON f.OrderDateKey = d.DateKey
WHERE d.CalendarYear = 2013 
 AND d.EnglishDayNameOfWeek = 'Monday'
GROUP BY DATENAME(m,[OrderDate])
 , p.EnglishProductName;


If we add a clustered index on Date Key and then test the same;

CREATE CLUSTERED INDEX ix_InternetSales
ON InternetSales (OrderDateKey);
GO

SELECT 
 DATENAME(m,[OrderDate])
 , p.EnglishProductName Product
  , SUM(SalesAmount) AS TotalSales
  , SUM(TaxAmt) AS TotalTaxAmount
FROM dbo.InternetSales f
 INNER JOIN dbo.DimProduct p
  ON f.ProductKey = p.ProductKey
 INNER JOIN dbo.DimDate d
  ON f.OrderDateKey = d.DateKey
WHERE d.CalendarYear = 2013 
 AND d.EnglishDayNameOfWeek = 'Monday'
GROUP BY DATENAME(m,[OrderDate])
 , p.EnglishProductName;

We will see a significant difference between previous number of pages read and number of pages read now.



Now the question is, partitioning this table will give more benefits or not. Let's partition and see.

-- Create a partition function
CREATE PARTITION FUNCTION pf_InternetSalesDate (int)
AS RANGE RIGHT
FOR VALUES ('20100101', '20110101', '20120101', '20130101', '20140101')
GO

-- Create the scheme
CREATE PARTITION SCHEME ps_InternetSalesDate
AS PARTITION pf_InternetSalesDate
TO (HistoryFileGroup, HistoryFileGroup, HistoryFileGroup, HistoryFileGroup, HistoryFileGroup, CurrentFileGroup)
GO

-- Dropping clustered index for partitioning
DROP INDEX ix_InternetSales ON InternetSales;
GO

-- Recreating with Partitioning scheme
CREATE CLUSTERED INDEX ix_InternetSales
ON InternetSales (OrderDateKey)
ON ps_InternetSalesDate (OrderDateKey);
go

-- And executing the same Business user's query
SELECT 
 DATENAME(m,[OrderDate])
 , p.EnglishProductName Product
  , SUM(SalesAmount) AS TotalSales
  , SUM(TaxAmt) AS TotalTaxAmount
FROM dbo.InternetSales f
 INNER JOIN dbo.DimProduct p
  ON f.ProductKey = p.ProductKey
 INNER JOIN dbo.DimDate d
  ON f.OrderDateKey = d.DateKey
WHERE d.CalendarYear = 2013 
 AND d.EnglishDayNameOfWeek = 'Monday'
GROUP BY DATENAME(m,[OrderDate])
 , p.EnglishProductName;

Here is the result of it;


This result does not show much improvement when compare with previous result, hence one can come to a conclusion saying that partitioning is not required when a clustered index is created on date key. But even with this table, we see a slight improvement, means that a large table will definitely benefits with partitioning.

Okay, now what about this table? should we keep partitioning or not? I prefer to have both implementations because Partitioning gives few more benefits in addition to the performance benefit it offers. It offers More Granular Manageability, letting us perform some operation such as indexing, compression and backing up at partition level instead of on the whole table. It improves Data Loading since loading can be managed at partition levels. Considering all of these, should not we create the clustered index on date key as well as partition the table using date key?


No comments: