CTEs are not Staging Tables

By Greg 1 comment

If you have a complex query, a CTE (Common Table Expression) can be a great way to break down the query into individual parts. What is often overlooked by developers is that CTEs can cause significant performance problems when used incorrectly.

A common pattern I have seen is joining to a CTE multiple times. This is done in the belief that the CTE will run once and stage the data, and then we can further manipulate the data to get the desired output. Is this belief right? Let’s find out.

For this example, we’ll be using Microsoft’s Northwind database to get a list of employees and count the number of orders they have placed. Pretty simple query, but lets say we want to put this in a CTE because we want to group data a different way in the query:

WITH EmployeeTotalSales AS
(
	SELECT		Orders.EmployeeID
			,COUNT(Orders.OrderID) AS TotalSales
	FROM		Orders

	GROUP BY	Orders.EmployeeID
)
SELECT	*
FROM	EmployeeTotalSales AS e1

The question to ask then, is how many times will the Orders table be read? For this query, it’s just the once. We do a scan of Orders, run it through a stream aggregate, and output the result:

What happens though, if I join to the CTE more than once? Does the CTE keep a copy of the data hanging around? Let’s try it out:

--Get Total Sales for each employee for each customer
WITH EmployeeTotalSales AS
(
	SELECT		Orders.EmployeeID
				,COUNT(Orders.OrderID) AS TotalSales
	FROM		Orders

	GROUP BY	Orders.EmployeeID
)
SELECT	*
FROM	EmployeeTotalSales AS e1
INNER JOIN EmployeeTotalSales AS e2 ON e1.EmployeeID = e2.EmployeeID
INNER JOIN EmployeeTotalSales AS e3 ON e1.EmployeeID = e3.EmployeeID
INNER JOIN EmployeeTotalSales AS e4 ON e1.EmployeeID = e4.EmployeeID
INNER JOIN EmployeeTotalSales AS e5 ON e1.EmployeeID = e5.EmployeeID
INNER JOIN EmployeeTotalSales AS e6 ON e1.EmployeeID = e6.EmployeeID
INNER JOIN EmployeeTotalSales AS e7 ON e1.EmployeeID = e7.EmployeeID
INNER JOIN EmployeeTotalSales AS e8 ON e1.EmployeeID = e8.EmployeeID
INNER JOIN EmployeeTotalSales AS e9 ON e1.EmployeeID = e9.EmployeeID

Nope, it runs the exact same query every time. Every time you join to a CTE, SQL Server runs that CTE again. SQL Server does this because you could run a CTE multiple times and get different results. You might have a where clause in the CTE, or the join predicate might be different each time you access it.

What should you do when you need to reference the same dataset many times? As with all things in IT: it depends. In the above example, it probably doesn’t matter that you hit the orders table many times. It’s doing a scan on less than 1000 rows and then doing an aggregate over the top of it. If you did need to join to the data that many times (I don’t know why you would need to) then scanning the orders table vs scanning a temp table (or even seeking if you put an index on it) wouldn’t make a huge difference.

But CTEs are often far more complex. They often involve many tables, sometimes CTEs are built off CTEs, joining to what has come before. Sometimes the processing can take seconds or minutes or even longer to produce the dataset just once. If that’s the case then don’t make SQL to go and query the data again. Put it in a temp table. Tell SQL Server to calculate the data once and stage it in a staging table, and then reference that data in your queries. It will often make your queries much faster, it will often allow SQL to get a better query plan (compilation timeout seems to be common when many CTEs are nested), and you can even index the staged data to further improve performance when you join to it.

1 Comment

Siddharth

Aug 8, 2020, 2:30 am

Good Post. Help to understand that CTE is not a staging table

Leave a Reply