Distinct List in String_Agg

By Greg 5 comments

SQL introduced the new STRING_AGG feature in SQL 2017, and it works just like it suggests it would: it’s an aggregate function that takes all of the string values and joins them together with a separator. To see how it works, I’m using the StackOverflow users table, and let’s say we want to create a list of Display Names and we’re going to group it based on Location:

SELECT		STRING_AGG(StackOverflowUsers.DisplayName, ', ')
		,StackOverflowUsers.Location
FROM		StackOverflowUsers
WHERE		StackOverflowUsers.DisplayName LIKE 'Greg D%'
GROUP BY	StackOverflowUsers.Location

I’ve limited it to just Display Names starting with ‘Greg D%’ just to cut down our dataset to a manageable 150 records. The result is as expected, all of the display names appear in a string:

Nice and simple, the STRING_AGG function has taken all of the Greg’s and comma separated them by Location. But what if we needed a distinct list of Greg’s in each location. So looking at the top 2 records, what if we only want 1 ‘Greg D’ in location ‘NULL’ and 1 ‘Greg D’ in location ‘Seattle, WA, USA’? We can’t just take a top 1, because we still want both ‘Greg Dunn’ and ‘Greg Dean’ in the location United States. How do you get a distinct list in STRING_AGG?

I want to propose 2 ways, the first is the way you probably should do it, the other…well, you’ll see.

The Nicer Way

The better way is to get the list of strings you want to be distinct first, then run the STRING_AGG over the top of it. We know we only want to see each name for each location once, so we could do a select distinct or a group by to achieve that:

SELECT		StackOverflowUsers.DisplayName
		,StackOverflowUsers.Location
FROM		StackOverflowUsers
WHERE		StackOverflowUsers.DisplayName LIKE 'Greg D%'
GROUP BY	StackOverflowUsers.DisplayName
		,StackOverflowUsers.Location

Once we have our unique list of users and locations, we can then group them all up using STRING_AGG. You could put this in a sub query, a CTE, or a join. You could also dump the output into a temp table or table variable (always prefer a temp table). For now, I’m just going to make this the table we’re selecting from:

SELECT		STRING_AGG(innerQuery.DisplayName, ',')
		,innerQuery.Location
FROM
		(
			SELECT	StackOverflowUsers.DisplayName
				,StackOverflowUsers.Location
			FROM	StackOverflowUsers
			WHERE	StackOverflowUsers.DisplayName LIKE 'Greg D%'
			GROUP BY StackOverflowUsers.DisplayName
				,StackOverflowUsers.Location
		) AS innerQuery
GROUP BY	innerQuery.Location

Straight away it’s easier to read, and if we have a look at the query plan, SQL is much happier with life:

It’s so much cleaner

So if you need a distinct list of comma separated values this is the way to go.

Ugly Way

Ok, time for some bad ideas. If we can’t do the above, we could join all the names together, then we can split them based on the delimiter, take a distinct list, and then re-join them all together. Sound confusing? Yep, it is. Let’s have a look at the code to see if that makes any more sense:


SELECT
	(
		SELECT	STRING_AGG(innerAgg.Result, ',')
		FROM
		(
			SELECT	DISTINCT
				value
			FROM	STRING_SPLIT(STRING_AGG(StackOverflowUsers.DisplayName, ','), ',')
		) AS innerAgg(Result)
	)
	,StackOverflowUsers.Location
FROM	StackOverflowUsers
WHERE	StackOverflowUsers.DisplayName LIKE 'Greg D%'
GROUP BY StackOverflowUsers.Location

So the magic is in the first column we’re selecting. As you can see, we’re selecting a STRING_AGG which is aggregating all of the strings from innerAgg.Result. What’s innerAgg.Result you ask? We’ll, it’s the output of the next inner query. That query is selecting the distinct values from splitting a string. What are we splitting? Another good question, top of the class! We’re splitting the String Aggregate of DisplayName. So SQL will put all of them into a string, split them all out, select the distinct values, then stick them all back in together again. Surprisingly, SQL does this fairly well. It already has all the data from the Users table, so it just runs over the top and does the leg work.

Ok, it’s ugly

There’s plenty of problems with doing it this way, one of which is that it doesn’t cater at all for someone who has a comma in their name. So why on earth would I pick this option? Well…there was this one time where we had some tables being joined together, all hidden in a view, and the only thing I was allowed to change was the 1 column in the select list. Did it work? Yep. Was I happy with it? Secretly I was. Would I want to do it again? Nope. Stick with the first option.

Let me know in the comments if you have another way of aggregating a distinct list of strings or if you have a worse way than this?

5 Comments

Getting Distinct Values before STRING_AGG – Curated SQL

Aug 8, 2021, 10:10 pm

[…] Greg Dodd shows how to remove duplicate values from a list before passing them to the STRING_AGG() f…: […]

Sajan Mani

May 5, 2022, 11:08 am

Awesome query, helped me big time with a query I had to write.

Thank you!

Karin van Vaalen

Aug 8, 2022, 8:06 pm

Thanks a lot! Exactly what I was looking for.

Markie

Aug 8, 2022, 6:37 am

Awesome!

Dima

Jun 6, 2023, 10:44 pm

The ugly way is not really ugly. Especially if you need STRING_AGG in more complex query than just Select Id, Name…
You can use other separator instead of comma (like | (Vertical bar)) vhich is not often present in any data.
The first apprach seems good in this simple example, but is not good in more complex query (like Select col1, col2, … col10, Sum(col11) )
You’ll need to make key column and join (select keyCol, AggDistinct) to your main request.

Leave a Reply