T-SQL: Aggregate Columns (CASE or Normalization + Aggregation)

In this example I will compare the execution of two possible options to aggregate multiple columns.
I have data columns A, B and C and I need to show the average.

“CASE” is using a formula ((A + B + C) / (1, 2 or 3)), simulating the scenarios when one or more columns (A, B, C) is NULL (divide by 1, 2 or 3).
“Normalization and Aggregation” is using a subquery to normalize the columns (A, B, C) and calculates the AVG().

A logic to replace 0 with NULL and not include it in the aggregation is applied in both cases.

I will execute the queries five times to check manualy the execution times.

Create test database, test table and insert 1,000,000 random rows.

Create database.

USE master;
GO

IF (DB_ID('PL') IS NOT NULL)
	BEGIN
		ALTER DATABASE PL SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
		DROP DATABASE PL;
	END

CREATE DATABASE PL;
GO

USE PL;
GO

Create SourceTable.

IF (OBJECT_ID('PL.dbo.SourceTable') IS NOT NULL)
	BEGIN DROP TABLE PL.dbo.SourceTable; END
GO

CREATE TABLE PL.dbo.SourceTable
	(
		ID INT IDENTITY(1, 1) PRIMARY KEY
		, Structure1 NVARCHAR(16)
		, Structure2 NVARCHAR(32)
		, Data1 DECIMAL(6, 2)
		, Data2 DECIMAL(6, 2)
		, Data3 DECIMAL(6, 2)
	);
GO

CREATE INDEX IX_Structure1Structure2 ON PL.dbo.SourceTable (Structure1, Structure2);
GO

-- Populate with random values - Start
WITH CTE AS
	(
		SELECT TOP (1000000)
			CHAR(65 + FLOOR(RAND(CONVERT(VARBINARY, NEWID())) * 26)) AS Structure1
			, CHAR(65 + FLOOR(RAND(CONVERT(VARBINARY, NEWID())) * 26)) AS Structure2
			, ROUND(101 * RAND(CONVERT(VARBINARY, NEWID())), 2) AS Data1
			, ROUND(101 * RAND(CONVERT(VARBINARY, NEWID())), 2) AS Data2
			, ROUND(101 * RAND(CONVERT(VARBINARY, NEWID())), 2) AS Data3
		FROM
			AdventureWorks2008R2.sys.all_objects AS T1
			cross join AdventureWorks2008R2.sys.all_objects AS T2
	)
INSERT PL.dbo.SourceTable (Structure1, Structure2, Data1, Data2, Data3)
SELECT Structure1, Structure2, Data1, Data2, Data3
FROM CTE;
GO

-- NULL
UPDATE PL.dbo.SourceTable
SET Data1 = NULL
WHERE Data1 BETWEEN 46 AND 52;
GO

UPDATE PL.dbo.SourceTable
SET Data2 = NULL
WHERE Data2 BETWEEN 23 AND 25;
GO

UPDATE PL.dbo.SourceTable
SET Data3 = NULL
WHERE Data3 BETWEEN 12 AND 16;
GO

UPDATE PL.dbo.SourceTable
SET
	Data1 = NULL
	, Data2 = NULL
	, Data3 = NULL
WHERE Data1 + Data2 + Data3 BETWEEN 73 AND 85;
GO
-- Populate with random values - End

SELECT *
FROM PL.dbo.SourceTable;
GO
-- Execution time (seconds): 22, 21, 21, 21, 21

Aggregate Data1, Data2 and Data3 by using CASE statement.

In the estimated execution and Actual execution plans we see that 1,000,000 rows are selected and calculated with no joins. Estimated Execution Plan. Actual Execution Plan. Show/hide STATISTICS

SET STATISTICS IO ON; -- OFF
SET STATISTICS TIME ON; -- OFF
GO

Aggregate Data1, Data2 and Data3 by normalizing with subquery and AVG().

SELECT
	ID
	, Structure1, Structure2
	, Data1, Data2, Data3
	, '<Src | 12>' AS '<->'
	, (ISNULL(Data1, 0) + ISNULL(Data2, 0)) AS 'Data1 + Data2'
	, (
			SELECT AVG(AB.Col)
			FROM  
				(
					SELECT NULLIF(Data1, 0) AS Col -- exclude 0 from the aggregation
					UNION ALL SELECT NULLIF(Data2, 0)
				) AS AB
	) AS Avg12
	, '<12 | 123>' AS '<->'
	, Data1, Data2, Data3
	, (ISNULL(Data1, 0) + ISNULL(Data2, 0) + ISNULL(Data3, 0)) as 'Data1 + Data2 + Data3'
	, (
			SELECT AVG(ABC.Col)
			FROM  
				(
					SELECT NULLIF(Data1, 0) AS Col
					UNION ALL SELECT NULLIF(Data2, 0)
					UNION ALL SELECT NULLIF(Data3, 0)
				) AS ABC
	) AS Avg123
FROM PL.dbo.SourceTable;
GO
-- Execution time (seconds): 31, 30, 30, 31, 30

The estimated execution plan shows that the aggregations are on every line (3 rows looped on every line).

The actual execution plan shows that the aggregations are done once (1,000,000 rows * 3 columns = 3,000,000 rows). The Query optimizer decides to use a set-based operation instead of aggregation on every row.

Comparing the reads shows the same and the timea are close. It’s little slower when we normalize, because of the join of the data.

Delete database.

USE master;
GO

IF (DB_ID('PL') IS NOT NULL)
	BEGIN
		ALTER DATABASE PL SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
		DROP DATABASE PL;
	END
GO

Conclusion:

The query optimizer gave me the answer to use “Normalization and Aggregation” option.
When using the “CASE” code, make sure that all the possible scenarios are covered.
In both cases, check if the data table is using 0 for NULL and decide to not/exclude 0 in the aggregation.
This exact example acts like a table-valued function, which is executed once. The scalar function is executed in every line.

Keep it simple :-)

T-SQL: Aggregate Columns (CASE or Normalization + Aggregation)

AUTHOR

Peter Lalovsky

Leave a comment Cancel reply