Tuesday, February 23, 2010

Understanding the GROUPING SETS

SQL Server 2008 introduced GROUPING SETS as an extension to the GROUP BY, facilitating us to combine more than one group into a single result set. We can think it is as, combining multiple grouped aggregations into a single result set by using UNION ALL.

This is almost similar to CUBE and ROLLUP operators but easy to implement and offers best performance. Not only that, it is ANSI SQL 2006 compliant too.

Let’s understand this. The below code creates a SalesTransactions table and inserts 6 records. Then query to see the Sales Amount by (1.) Year, (2.) Customer, (3.) Year, Customer and Product. Result is shown too.

   1: USE tempdb
   2: GO
   3:  
   4: CREATE TABLE SalesTransactions
   5: (    [Year] smallint
   6:     , Customer varchar(30)
   7:     , Product varchar(30)
   8:     , Amount money
   9: )
  10: GO
  11: INSERT INTO SalesTransactions VALUES
  12:     (2001, 'Hendergart', 'Mountain-100 Black, 42', 100)
  13:     ,(2001, 'Hendergart', 'AWC Logo Cap, 42', 200)
  14:     ,(2001, 'Hendergart', 'Long-Sleeve Logo Jersey, M', 600)
  15:     ,(2001, 'Collins', 'Long-Sleeve Logo Jersey, M', 400)
  16:     ,(2002, 'Collins', 'Mountain-100 Black, 42', 300)
  17:     ,(2002, 'Collins', 'Mountain-100 Black, 42', 100)
  18:  
  19: SELECT [Year], SUM(Amount) Amount
  20: FROM SalesTransactions
  21: GROUP BY [Year]
  22:  
  23: SELECT Customer, SUM(Amount) Amount
  24: FROM SalesTransactions
  25: GROUP BY Customer
  26:  
  27: SELECT [Year], Customer, Product, SUM(Amount) Amount
  28: FROM SalesTransactions
  29: GROUP BY [Year], Customer, Product

2
If all grouped aggregations need to be compiled into a single result set, we can use UNION ALL as below.

   1: SELECT [Year], NULL AS Customer, NULL AS Product, SUM(Amount) AS Amount
   2: FROM SalesTransactions
   3: GROUP BY [Year]
   4: UNION ALL
   5: SELECT NULL AS [Year], Customer, NULL AS Product, SUM(Amount) AS Amount
   6: FROM SalesTransactions
   7: GROUP BY Customer
   8: UNION ALL
   9: SELECT [Year], Customer, Product, SUM(Amount) AS Amount
  10: FROM SalesTransactions
  11: GROUP BY [Year], Customer, Product
union

Okay, that’s the old technique, see how easily we can generate the same result set by using GROUPING SET. All you have to do is, add required grouping in GROUPING SET.

   1: SELECT [Year], Customer, Product, SUM(Amount) AS Amount
   2: FROM SalesTransactions
   3: GROUP BY GROUPING SETS ([YEAR], (Customer), ([YEAR], Customer, Product))
grouping

I see a small issue with this result set. Too much of NULLs. Do you see it too? These NULLs are not real NULLs, how can we identify whether the NULLs in the result set are real or not? There are two ways, you can use either GROUPING function that accepts one parameter as the column name or GROUPING_ID that accepts multiple columns as a parameter. First function returns 0 or 1 that indicates whether it is a real NULL or result of the GROUPING respectively. The second function returns a bitmask that shows whether passed columns contains NULLs or not.

   1: SELECT [Year], Customer, Product, SUM(Amount) AS Amount
   2:     , GROUPING_ID([Year], Customer, Product) AS Bitmask
   3:     , GROUPING([Year]) AS YearGrouping
   4:     , GROUPING(Customer) AS CustomerGrouping
   5:     , GROUPING(Product) AS ProductGrouping
   6: FROM SalesTransactions
   7: GROUP BY GROUPING SETS ([YEAR], (Customer), ([YEAR], Customer, Product))

Monday, February 22, 2010

Understanding the TOP and TABLESAMPLE operators

Result set can be filtered in many ways. Most common ways is, using the 3rd primary property of the SQL SELECT statement which is the WHERE condition which filters specific set of rows from the result set.

This post is not about WHERE condition, it is all about TOP and TABLESAMPLE.

The TOP operator gives you top n number of records as you need. The “top n” can be changed by changing the order of the result set by using ORDER BY clause.

The TABLESAMPLE operator allows you to randomly pick data from the table. You can instruct to SQL Server to return specific number of records or percent of rows.

Where we can use this? I think that this is commonly used for getting the average from a large result set. If you have a table with millions of records and average has to be calculated on one of the columns, it is worthwhile to use TABLESAMPLE (10 PERCENT) rather than using all records.

   1: -- use all rows in the table for calculating
   2: SELECT AVG(Freight)
   3: FROM Sales.SalesOrderHeader
   4: -- use approximately 10% of rows
   5: SELECT AVG(Freight)
   6: FROM Sales.SalesOrderHeader
   7: TABLESAMPLE (10 PERCENT)

REPEATABLE OPTION
If you execute the above two codes again and again, the average returns from first statement is always same but second statement. This is because of the way it picks records. If you need the same average for all the execution of second statement, REPEATABLE option should be used. The REPEATABLE option has to be used with a repeat_seed, and as long as the repeat_seed is same and no records have been change, same average is returned.

   1: SELECT AVG(Freight)
   2: FROM Sales.SalesOrderHeader
   3: TABLESAMPLE (10 PERCENT) REPEATABLE (1)

What is SYSTEM option?
The SYSTEM option is optional, but it is used by default though you do not use it in your query.

   1: SELECT AVG(Freight)
   2: FROM Sales.SalesOrderHeader
   3: TABLESAMPLE SYSTEM (10 PERCENT)

The SYSTEM option returns approximate percentage of rows and generates a random value for each data page. SQL Server decides which data pages to be included for the sample based on the random value generated and the percentage specified in the query. If a page is decided to to be included, all rows in the page is included for sampling, else the page is excluded. If you use set the TABLESAMPLE with number of rows, instead of a percentage, number of rows will be converted to a percent and process same way.

Monday, February 15, 2010

Are you calculating Average correctly?

Almost all engineers have used the AVG aggregate function, but have you ever checked whether the result is correct? This was discussed while I was conducting my new SQL class, here is an example for explaining it, where you might make a mistake;

   1: CREATE TABLE dbo.EmployeeCommision
   2: (
   3:     EmployeeId int PRIMARY KEY,
   4:     CommisionGiven money NULL
   5: );
   6:  
   7: -- Inserting five employees' commison records
   8: -- In real implementation, two tables 
   9: -- will maintain for holding employees records
  10: -- and commison records
  11: INSERT INTO dbo.EmployeeCommison 
  12:     (EmployeeId, CommisionGiven)
  13: VALUES
  14:     (1, 50), (2, 50), (3, NULL), (4, 50), (5,50)
  15:   
  16: -- Calculating the average commision given 
  17: -- to an employee, You may use LEFT OUTER JOIN 
  18: -- if two tables are maintained
  19: SELECT AVG(CommisonGiven) 
  20: FROM dbo.EmployeeCommision
  21: -- Result of above is 50, which is wrong in 
  22: -- this case to correct the issue, use 
  23: -- ISNULL function with AVG function
  24:  
  25: SELECT AVG(ISNULL(CommisonGiven, 0)) 
  26: FROM dbo.EmployeeCommision
  27: -- Now the result is 40, which is correct.

The reason for this is because, all aggregate functions ignore NULL values. Only exception is COUNT(*) that counts NULL values.

Wednesday, February 10, 2010

Developers Training Kit for SQL Server 2008 R2

Microsoft has published "SQL Server 2008 R2 - Training Kit for Developers" that contains presentations, demos, videos and hands-on lab related to new features of SQL Server 2008 R2. It is free, all can download. Here is the link: http://www.microsoft.com/downloads/details.aspx?displaylang=en&FamilyID=fffaad6a-0153-4d41-b289-a3ed1d637c0d This requires new AdventureWorks database, it is available here: http://msftdbprodsamples.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=24854

Tuesday, February 9, 2010

Hands-On Labs: Authoring Reports with Reporting Services 2008 R2

I will be doing a Hands-On labs session tomorrow (10th Wednesday) at TechEd Sri Lanka 2010, on Authoring Reports with Reporting Services 2008 R2. This session includes some of the new features of R2 like Maps, Sparklines, DataBars, Shared DataSets, Shared Report Items, and more if time is permitted. If you interest, make sure you are at the lab after lunch, it has been scheduled from 1.15pm to 3.15pm.

Monday, February 8, 2010

TechEd Sri Lanka 2010

It is happening..... It started yesterday and continues till Wednesday. It seems that a lot of good sessions have been scheduled, will blog about sessions later. If you guys come, visit our MVP stall, we have arranged a kind of raffle... you can try out an easy puzzle and grab a portable hard-disk :). If you guys interest in Business Intelligence, visit my company stall, IronOne stall and see what we have as Business Intelligence products. for more info about TechEd, visit http://www.teched.lk.