Sunday, June 4, 2017

Calculate the average value against a large table - SQL Server Brain Basher of the Week #068

Let's talk about another Interview Question related to SQL Server development. This is about one of the aggregate functions that is AVG. Here is the question;

You have a table called SalesOrderHeader and it has millions of records. It has a column called Freight and you need to the average of it. You need to make sure that only 10% of records is used for calculating the average.

What would be the best way? The standard answer is, write a query to get 10% of records using TOP and calculate the average.

SELECT AVG(t.Freight) AverageFreight
FROM (SELECT TOP 10 PERCENT Freight
 FROM Sales.SalesOrderHeader) t

But this might not give you the accurate average as you consider only first set of records. It will be more accurate if you pick records randomly and then calculate the average. You may add ORDER BY to your statement with NEW_ID function.

SELECT AVG(t.Freight) AverageFreight
FROM (SELECT TOP 10 PERCENT Freight
 FROM Sales.SalesOrderHeader
 ORDER BY NEWID()) t

Second method gives the most accurate value as it picks records randomly. However the cost is high with the statement. There is another way to achieve the same. It is using TABLESAMPLE operator.

SELECT AVG(Freight) AverageFreight
FROM Sales.SalesOrderHeader
TABLESAMPLE (10 PERCENT)

Here is the comparison between both methods. Notice the total cost.


As you see, TABLESAMPLE gives better performance than the first method.

Since it picks records randomly, the average it returns different at each execution. If you need to same value for all your executions, REPEATABLE option has to be used with repeat_seed

SELECT AVG(Freight) AverageFreight
FROM Sales.SalesOrderHeader
TABLESAMPLE (10 PERCENT) REPEATABLE (1)

You will get the same average as long as the repeat_seed is same.

No comments: