Rollup and Cube Part 2
Previously in Part 1, we looked at the ROLLUP keyword that can be applied to SQL Select queries. We are now going to look at the similar, but different CUBE keyword.
As we saw last time, the ROLLUP keyword would sum each part of the GROUP BY and give you a running total. The CUBE keyword is similar, in that it gives running totals, but it gives you a running total across all of the different combinations of the GROUP BY clause. Let’s take a look
SELECT County.CountyName
,County.State
,Population.Year
,SUM(Population.Population)
FROM Population
INNER JOIN County
ON County.Id = Population.CountyId
WHERE County.State IN ( 'HI', 'RI' )
AND Population.Year > 2016
GROUP BY CUBE(Population.Year, County.State, County.CountyName)
Almost the same query as last time, we’re just using the CUBE instead of ROLLUP in the GROUP BY clause. Here are the results:
For those with a keen eye you’ll see that I’ve started at row 28 in that screenshot. When we run the GROUP BY without ROLLUP or CUBE we get just 16 rows. With ROLLUP that grows to 23, but with CUBE it explodes out to 57. Why?
It’s because SQL will partition the aggregates by all combinations of the GROUP BY clause. So there are rows like row 29 above, where we see the raw number of people in the County in a State in a given year, we see rows like row 41 which is the same county, this time grouped just on County and Year, rows like row 55 grouping the data on State and Year, rows like row 57 grouping the data on just State, row 51 grouping the data on just the year and row 33 showing all the data summed up.
How is this useful? I think there is great power in this when it comes to users wanting to explore the data. Analysis Services has this great concept that you can drill in and out of data, and it essentially just slices the data by one more or one less thing. I suspect this is why the keyword is CUBE – SQL is producing a data set that is perfect for this sort of query. We’ll discuss that further in Part 3 of this series.
To wrap up this post though, let’s have a quick look at the downside: Performance. While the ROLLUP keyword just added a single extra aggregate, the CUBE keyword, even with a simple dataset like this, explodes out:
You can see that for every part of the GROUP BY clause, SQL spools out another thread to sort the data and then compute the aggregates, essentially doing a ROLLUP on each combination and order of the GROUP BY clause. Looking at the stats, if you just GROUP the data then SQL Server does 26 logical reads, which is the same if you include a single rollup. However if you CUBE the data then that jumps to 65 logical reads. Now on a dataset this size, who cares, but you can see how this could quickly grow out of control.