T-SQL Optimization: Nonclustered Index and GROUP BY


In this example I play with the performance of a query that groups the data.

Clone database.

USE [master];
GO

IF (DB_ID(N'PL_AdventureWorks2008R2') IS NOT NULL)
	BEGIN
		ALTER DATABASE [PL_AdventureWorks2008R2] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
		DROP DATABASE [PL_AdventureWorks2008R2];
	END
GO

SELECT [name]
FROM sys.databases
WHERE [name] = N'PL_AdventureWorks2008R2';
GO

BACKUP DATABASE [AdventureWorks2008R2]
	TO DISK = N'C:\Temp\Backups\PL_AdventureWorks2008R2_Data.bak'
	WITH INIT;
GO

RESTORE DATABASE [PL_AdventureWorks2008R2]
FROM DISK = N'C:\Temp\Backups\PL_AdventureWorks2008R2_Data.bak'
	WITH RECOVERY
		, MOVE 'AdventureWorks2008R2_Data' TO 'C:\Temp\Databases\PL_AdventureWorks2008R2_Data.mdf'
		, MOVE 'AdventureWorks2008R2_Log' TO 'C:\Temp\Databases\PL_AdventureWorks2008R2_Log.ldf';
GO

ALTER DATABASE [PL_AdventureWorks2008R2]
MODIFY FILE
 (
	 NAME = N'AdventureWorks2008R2_Data'
	 , NEWNAME = N'PL_AdventureWorks2008R2_Data'
 );
GO
 
ALTER DATABASE [PL_AdventureWorks2008R2]
MODIFY FILE
 (
	 NAME = N'AdventureWorks2008R2_Log'
	 , NEWNAME = N'PL_AdventureWorks2008R2_Log'
 );
GO

SELECT [name], [physical_name], [type_desc]
FROM [PL_AdventureWorks2008R2].sys.database_files;
GO

Create source table.
USE PL_AdventureWorks2008R2;
GO

CREATE SCHEMA pl;
GO

IF (OBJECT_ID(N'PL_AdventureWorks2008R2.pl.TestTableSource', N'U') IS NOT NULL)
	BEGIN DROP TABLE PL_AdventureWorks2008R2.pl.TestTableSource; END
GO

SELECT
	SOH.CustomerID, SOD.ProductID
	, SOD.OrderQty, SOD.LineTotal
INTO PL_AdventureWorks2008R2.pl.TestTableSource
FROM
	Sales.SalesOrderHeader AS SOH
	JOIN Sales.SalesOrderDetail As SOD
		ON SOH.SalesOrderID = SOD.SalesOrderID;
GO
-- (121317 row(s) affected)

-- Add a few fake rows
INSERT PL_AdventureWorks2008R2.pl.TestTableSource
	(
		CustomerID, ProductID
		, OrderQty, LineTotal
	)
SELECT
	CustomerID, ProductID
	, (OrderQty + (101 * RAND(CONVERT(VARBINARY, NEWID()))))
	, (LineTotal + (101 * RAND(CONVERT(VARBINARY, NEWID()))))
FROM PL_AdventureWorks2008R2.pl.TestTableSource;
GO 6
/*
Beginning execution loop

(121317 row(s) affected)

(242634 row(s) affected)

(485268 row(s) affected)

(970536 row(s) affected)

(1941072 row(s) affected)

(3882144 row(s) affected)
Batch execution completed 6 times.
*/

-- http://sqlperformance.com/2014/10/t-sql-queries/bad-habits-count-the-hard-way
SELECT SUM(P.[rows])
FROM
	PL_AdventureWorks2008R2.sys.partitions AS P
	JOIN PL_AdventureWorks2008R2.sys.tables AS T
		ON P.[object_id] = T.[object_id]
	JOIN PL_AdventureWorks2008R2.sys.schemas AS S
		ON T.[schema_id] = S.[schema_id]
WHERE
	P.[index_id] IN (0, 1) -- heap or clustered index
	AND T.[name] = N'TestTableSource'
	AND S.[name] = N'pl';
GO
-- 7764288

-- Space used
EXEC sp_spaceused 'PL_AdventureWorks2008R2.pl.TestTableSource';
GO

Space Used 1

Turn on the statistics and the execution plan to know what is going on under the hood.
Turn them off while testing manually the execution time.

SET STATISTICS IO ON; -- OFF
SET STATISTICS TIME ON; -- OFF
GO

-- Ctrl + M (Include Actual Execution Plan)

Test query.
IF (OBJECT_ID(N'tempdb.dbo.#TestTableResult', N'U') IS NOT NULL)
	BEGIN DROP TABLE #TestTableResult; END

SELECT
	CustomerID, ProductID
	, SUM(OrderQty) AS Sum_OrderQty
	, SUM(LineTotal) AS Sum_LineTotal
INTO #TestTableResult
FROM PL_AdventureWorks2008R2.pl.TestTableSource
GROUP BY CustomerID, ProductID;
GO

Turn off the statistics and the actual execution plan and run the tests multiple times to observe the execution time.

Test 1: Heap
Primary key column = No, Indexes = No.
(79433 row(s) affected)
Execution Time (sec): 7, 7, 7, 7, 7

Test 1 Execution Plan

Test 1 Statistics

Test 2: Primary Key (clustered index)
Add column ID – identity, primary key
(79433 row(s) affected)
Execution Time (sec): 7, 7, 7, 7, 7

ALTER TABLE PL_AdventureWorks2008R2.pl.TestTableSource
	ADD ID INT IDENTITY(1, 1);

ALTER TABLE PL_AdventureWorks2008R2.pl.TestTableSource
	ADD CONSTRAINT PK_TestTableSource PRIMARY KEY (ID);
GO

Test 2 Execution Plan

Test 2 Statistics

Test 3: PK = Yes; Nonclustered Index = Yes
Key columns are the columns that will be used for grouping.
(79433 row(s) affected)
Execution Time (sec): 7, 7, 7, 7, 7

CREATE NONCLUSTERED INDEX IX_Test3
	ON PL_AdventureWorks2008R2.pl.TestTableSource (CustomerID, ProductID);
GO

Test 3 Execution Plan

Test 3 Statistics

Test 4: PK = Yes; Nonclustered Indes with Included Columns = Yes
Delete the previously created index and create a new one.
Key columns are the columns that will be used for grouping.
Included columns are the ones that will be aggregated.
(79433 row(s) affected)
Execution Time (sec): 4, 3, 3, 3, 3

DROP NONCLUSTERED INDEX IX_Test3
	ON PL_AdventureWorks2008R2.pl.TestTableSource;
GO

CREATE NONCLUSTERED INDEX IX_Test4
	ON PL_AdventureWorks2008R2.pl.TestTableSource (CustomerID, ProductID)
	INCLUDE (OrderQty, LineTotal);
GO

Test 4 Execution Plan

Test 4 Statistics

Test 5: PK = No; Nonclustered Indes with Included Columns = Yes

Delete column ID
(79433 row(s) affected)
Execution Time (sec): 3, 3, 3, 3, 3

ALTER TABLE PL_AdventureWorks2008R2.pl.TestTableSource
	DROP CONSTRAINT PK_TestTableSource;

ALTER TABLE PL_AdventureWorks2008R2.pl.TestTableSource
	DROP COLUMN ID;
GO

Test 5 Execution Plan

Test 5 Statistics

Space used by the nonclustered index is The Price I Pay.

-- Space used
EXEC sp_spaceused 'PL_AdventureWorks2008R2.pl.TestTableSource';
GO

Space Used 2

Delete database.

USE [master];
GO

IF (DB_ID(N'PL_AdventureWorks2008R2') IS NOT NULL)
	BEGIN
		ALTER DATABASE [PL_AdventureWorks2008R2] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
		DROP DATABASE [PL_AdventureWorks2008R2];
	END
GO

Delete C:\Temp\Backups\PL_AdventureWorks2008R2_Data.bak

Conclusion:

  • Adding a noncluster index with key columns the grouped columns and included the aggregated columns is dividing the execution time by 2.33.
  • The primary key¬†(Column ID) is not adding a value to the performance.
  • In this example the nonclustered index doubles the disk space used.

Keep it simple :-)