Distinct List in String_Agg
SQL introduced the new STRING_AGG feature in SQL 2017, and it works just like it suggests it would: it’s an aggregate function that takes all of the string values and joins them together with a separator. To see how it works, I’m using the StackOverflow users table, and let’s say we want to create a list of Display Names and we’re going to group it based on Location:
SELECT STRING_AGG(StackOverflowUsers.DisplayName, ', ')
,StackOverflowUsers.Location
FROM StackOverflowUsers
WHERE StackOverflowUsers.DisplayName LIKE 'Greg D%'
GROUP BY StackOverflowUsers.Location
I’ve limited it to just Display Names starting with ‘Greg D%’ just to cut down our dataset to a manageable 150 records. The result is as expected, all of the display names appear in a string:
Nice and simple, the STRING_AGG function has taken all of the Greg’s and comma separated them by Location. But what if we needed a distinct list of Greg’s in each location. So looking at the top 2 records, what if we only want 1 ‘Greg D’ in location ‘NULL’ and 1 ‘Greg D’ in location ‘Seattle, WA, USA’? We can’t just take a top 1, because we still want both ‘Greg Dunn’ and ‘Greg Dean’ in the location United States. How do you get a distinct list in STRING_AGG?
I want to propose 2 ways, the first is the way you probably should do it, the other…well, you’ll see.
The Nicer Way
The better way is to get the list of strings you want to be distinct first, then run the STRING_AGG over the top of it. We know we only want to see each name for each location once, so we could do a select distinct or a group by to achieve that:
SELECT StackOverflowUsers.DisplayName
,StackOverflowUsers.Location
FROM StackOverflowUsers
WHERE StackOverflowUsers.DisplayName LIKE 'Greg D%'
GROUP BY StackOverflowUsers.DisplayName
,StackOverflowUsers.Location
Once we have our unique list of users and locations, we can then group them all up using STRING_AGG. You could put this in a sub query, a CTE, or a join. You could also dump the output into a temp table or table variable (always prefer a temp table). For now, I’m just going to make this the table we’re selecting from:
SELECT STRING_AGG(innerQuery.DisplayName, ',')
,innerQuery.Location
FROM
(
SELECT StackOverflowUsers.DisplayName
,StackOverflowUsers.Location
FROM StackOverflowUsers
WHERE StackOverflowUsers.DisplayName LIKE 'Greg D%'
GROUP BY StackOverflowUsers.DisplayName
,StackOverflowUsers.Location
) AS innerQuery
GROUP BY innerQuery.Location
Straight away it’s easier to read, and if we have a look at the query plan, SQL is much happier with life:
So if you need a distinct list of comma separated values this is the way to go.
Ugly Way
Ok, time for some bad ideas. If we can’t do the above, we could join all the names together, then we can split them based on the delimiter, take a distinct list, and then re-join them all together. Sound confusing? Yep, it is. Let’s have a look at the code to see if that makes any more sense:
SELECT
(
SELECT STRING_AGG(innerAgg.Result, ',')
FROM
(
SELECT DISTINCT
value
FROM STRING_SPLIT(STRING_AGG(StackOverflowUsers.DisplayName, ','), ',')
) AS innerAgg(Result)
)
,StackOverflowUsers.Location
FROM StackOverflowUsers
WHERE StackOverflowUsers.DisplayName LIKE 'Greg D%'
GROUP BY StackOverflowUsers.Location
So the magic is in the first column we’re selecting. As you can see, we’re selecting a STRING_AGG which is aggregating all of the strings from innerAgg.Result. What’s innerAgg.Result you ask? We’ll, it’s the output of the next inner query. That query is selecting the distinct values from splitting a string. What are we splitting? Another good question, top of the class! We’re splitting the String Aggregate of DisplayName. So SQL will put all of them into a string, split them all out, select the distinct values, then stick them all back in together again. Surprisingly, SQL does this fairly well. It already has all the data from the Users table, so it just runs over the top and does the leg work.
There’s plenty of problems with doing it this way, one of which is that it doesn’t cater at all for someone who has a comma in their name. So why on earth would I pick this option? Well…there was this one time where we had some tables being joined together, all hidden in a view, and the only thing I was allowed to change was the 1 column in the select list. Did it work? Yep. Was I happy with it? Secretly I was. Would I want to do it again? Nope. Stick with the first option.
Let me know in the comments if you have another way of aggregating a distinct list of strings or if you have a worse way than this?
Getting Distinct Values before STRING_AGG – Curated SQL
Aug 8, 2021, 10:10 pm[…] Greg Dodd shows how to remove duplicate values from a list before passing them to the STRING_AGG() f…: […]
Sajan Mani
May 5, 2022, 11:08 amAwesome query, helped me big time with a query I had to write.
Thank you!
Karin van Vaalen
Aug 8, 2022, 8:06 pmThanks a lot! Exactly what I was looking for.
Markie
Aug 8, 2022, 6:37 amAwesome!
Dima
Jun 6, 2023, 10:44 pmThe ugly way is not really ugly. Especially if you need STRING_AGG in more complex query than just Select Id, Name…
You can use other separator instead of comma (like | (Vertical bar)) vhich is not often present in any data.
The first apprach seems good in this simple example, but is not good in more complex query (like Select col1, col2, … col10, Sum(col11) )
You’ll need to make key column and join (select keyCol, AggDistinct) to your main request.
John Smith
Jun 6, 2024, 6:14 amThis was very helpful, specifically with getting a distinct list.
As it happens, Nicer didn’t work for me, and Ugly was just beautiful. Can you rename the sections accordingly please … 🙂
Thanks again.
kwikgal
Jul 7, 2024, 9:30 amThat’s just what I was looking for. As Dima says, the second method can be useful in more complex situations. I ended up using the not-so-ugly way.