In this example I will compare the execution of two possible options to aggregate multiple columns.
I have data columns A, B and C and I need to show the average.
- “CASE” is using a formula ((A + B + C) / (1, 2 or 3)), simulating the scenarios when one or more columns (A, B, C) is NULL (divide by 1, 2 or 3).
- “Normalization and Aggregation” is using a subquery to normalize the columns (A, B, C) and calculates the AVG().
A logic to replace 0 with NULL and not include it in the aggregation is applied in both cases.
I will execute the queries five times to check manualy the execution times.
Create test database, test table and insert 1,000,000 random rows.
Create database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
USE master; GO IF (DB_ID('PL') IS NOT NULL) BEGIN ALTER DATABASE PL SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE PL; END CREATE DATABASE PL; GO USE PL; GO |
Create SourceTable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
IF (OBJECT_ID('PL.dbo.SourceTable') IS NOT NULL) BEGIN DROP TABLE PL.dbo.SourceTable; END GO CREATE TABLE PL.dbo.SourceTable ( ID INT IDENTITY(1, 1) PRIMARY KEY , Structure1 NVARCHAR(16) , Structure2 NVARCHAR(32) , Data1 DECIMAL(6, 2) , Data2 DECIMAL(6, 2) , Data3 DECIMAL(6, 2) ); GO CREATE INDEX IX_Structure1Structure2 ON PL.dbo.SourceTable (Structure1, Structure2); GO -- Populate with random values - Start WITH CTE AS ( SELECT TOP (1000000) CHAR(65 + FLOOR(RAND(CONVERT(VARBINARY, NEWID())) * 26)) AS Structure1 , CHAR(65 + FLOOR(RAND(CONVERT(VARBINARY, NEWID())) * 26)) AS Structure2 , ROUND(101 * RAND(CONVERT(VARBINARY, NEWID())), 2) AS Data1 , ROUND(101 * RAND(CONVERT(VARBINARY, NEWID())), 2) AS Data2 , ROUND(101 * RAND(CONVERT(VARBINARY, NEWID())), 2) AS Data3 FROM AdventureWorks2008R2.sys.all_objects AS T1 cross join AdventureWorks2008R2.sys.all_objects AS T2 ) INSERT PL.dbo.SourceTable (Structure1, Structure2, Data1, Data2, Data3) SELECT Structure1, Structure2, Data1, Data2, Data3 FROM CTE; GO -- NULL UPDATE PL.dbo.SourceTable SET Data1 = NULL WHERE Data1 BETWEEN 46 AND 52; GO UPDATE PL.dbo.SourceTable SET Data2 = NULL WHERE Data2 BETWEEN 23 AND 25; GO UPDATE PL.dbo.SourceTable SET Data3 = NULL WHERE Data3 BETWEEN 12 AND 16; GO UPDATE PL.dbo.SourceTable SET Data1 = NULL , Data2 = NULL , Data3 = NULL WHERE Data1 + Data2 + Data3 BETWEEN 73 AND 85; GO -- Populate with random values - End SELECT * FROM PL.dbo.SourceTable; GO -- Execution time (seconds): 22, 21, 21, 21, 21 |
Aggregate Data1, Data2 and Data3 by using CASE statement.
In the estimated execution and Actual execution plans we see that 1,000,000 rows are selected and calculated with no joins. Estimated Execution Plan.
Actual Execution Plan.
Show/hide STATISTICS
1 2 3 |
SET STATISTICS IO ON; -- OFF SET STATISTICS TIME ON; -- OFF GO |
Aggregate Data1, Data2 and Data3 by normalizing with subquery and AVG().
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
SELECT ID , Structure1, Structure2 , Data1, Data2, Data3 , '<Src | 12>' AS '<->' , (ISNULL(Data1, 0) + ISNULL(Data2, 0)) AS 'Data1 + Data2' , ( SELECT AVG(AB.Col) FROM ( SELECT NULLIF(Data1, 0) AS Col -- exclude 0 from the aggregation UNION ALL SELECT NULLIF(Data2, 0) ) AS AB ) AS Avg12 , '<12 | 123>' AS '<->' , Data1, Data2, Data3 , (ISNULL(Data1, 0) + ISNULL(Data2, 0) + ISNULL(Data3, 0)) as 'Data1 + Data2 + Data3' , ( SELECT AVG(ABC.Col) FROM ( SELECT NULLIF(Data1, 0) AS Col UNION ALL SELECT NULLIF(Data2, 0) UNION ALL SELECT NULLIF(Data3, 0) ) AS ABC ) AS Avg123 FROM PL.dbo.SourceTable; GO -- Execution time (seconds): 31, 30, 30, 31, 30 |
The estimated execution plan shows that the aggregations are on every line (3 rows looped on every line).
The actual execution plan shows that the aggregations are done once (1,000,000 rows * 3 columns = 3,000,000 rows). The Query optimizer decides to use a set-based operation instead of aggregation on every row.
Comparing the reads shows the same and the timea are close. It’s little slower when we normalize, because of the join of the data.
Delete database.
1 2 3 4 5 6 7 8 9 |
USE master; GO IF (DB_ID('PL') IS NOT NULL) BEGIN ALTER DATABASE PL SET SINGLE_USER WITH ROLLBACK IMMEDIATE; DROP DATABASE PL; END GO |
Conclusion:
- The query optimizer gave me the answer to use “Normalization and Aggregation” option.
- When using the “CASE” code, make sure that all the possible scenarios are covered.
- In both cases, check if the data table is using 0 for NULL and decide to not/exclude 0 in the aggregation.
- This exact example acts like a table-valued function, which is executed once. The scalar function is executed in every line.
Keep it simple :-)