Only Left Joining based on Subsequent Inner Join

By Greg 2 comments

This is a bit of a tricky one to explain why you would use it, so let’s just go straight to an example. I’ve created the Schools database using this script from Microsoft, and then I’ve made some changes to it to normalise the Person table:

CREATE TABLE Instructor
(
	PersonId INT NOT NULL PRIMARY KEY,
	HireDate DATETIME
)

CREATE TABLE Student
(
	PersonId INT NOT NULL PRIMARY KEY,
	EnrollmentDate DATETIME
)


INSERT INTO Instructor (PersonId, HireDate)
SELECT Person.PersonID, Person.HireDate
FROM Person
WHERE Person.Discriminator = 'Instructor'


INSERT INTO Student (PersonId, EnrollmentDate)
SELECT Person.PersonID, Person.EnrollmentDate
FROM Person
WHERE Person.Discriminator = 'Student'

ALTER TABLE Person
DROP COLUMN EnrollmentDate, HireDate

Nothing special, just moving the EnrollmentDate and HireDate into their own tables.

For this made up example, lets say I want to select all of the Courses on offer, and display the name of the instructor IF the instructor was hired before 1st January 2000. My returned dataset should show the Course Title and the Instructors name IF the instructor was hired before 2000.

There’s lots of ways to do this, but let’s work through it together. First up, we need to join all the tables together. We know that we need the Course Title and the Person’s name, and we’ll need to hit the instructors table for the hire date, and that we should use some left joins because we always want to get the Course Title.

SELECT		Course.Title, Person.FirstName, Person.LastName, Instructor.HireDate
FROM		Course
LEFT JOIN	CourseInstructor
	   ON	CourseInstructor.CourseID = Course.CourseID

LEFT JOIN	Person
	   ON	Person.PersonID = CourseInstructor.PersonID

LEFT JOIN	Instructor
	   ON	Instructor.PersonId = Person.PersonID

That gets us most of the way. We now have all of the data, but the question is, how do we stop the Person’s name coming back if the HireDate is after 1 Jan 2000?

We can’t simply put it in a where clause, because then the Course Title won’t come back. We could put it as part of the join condition to Instructor, but that will only limit the hire date coming back. We could use that to build a case statement around the FirstName and LastName fields, but what other options are there?

What we want to do is limit the join to Person to only return if Instructor returns, and then we could put the join condition on Instructor. Something like this, but it doesn’t work:

SELECT		Course.Title, Person.FirstName, Person.LastName, Instructor.HireDate
FROM		Course
LEFT JOIN	CourseInstructor
	   ON	CourseInstructor.CourseID = Course.CourseID

LEFT JOIN	Person
	   ON	Person.PersonID = CourseInstructor.PersonID
	   AND	Instructor.PersonId IS NOT NULL

LEFT JOIN	Instructor
	   ON	Instructor.PersonId = Person.PersonID
	   AND	Instructor.HireDate < '1 Jan 2000'

We can do something like this in a lot of different ways. What we need to do is perform an inner join between Person and Instructor, without affecting the left join between CourseInstructor and Person. It turns out we can tell SQL that’s what we want in a lot of different ways. A common way would be to just put the join between Person and Instructor in a subquery, something like:

SELECT		Course.Title, Person.FirstName, Person.LastName, Person.HireDate
FROM		Course
LEFT JOIN	CourseInstructor
	   ON	CourseInstructor.CourseID = Course.CourseID

LEFT JOIN	(
	SELECT	Person.*, Instructor.HireDate
	FROM	Person
	INNER JOIN	Instructor
		   ON	Instructor.PersonId = Person.PersonID
	WHERE Instructor.HireDate < '1 Jan 2000'
	) Person
ON	Person.PersonID = CourseInstructor.PersonID

And that would certainly work. But it hides away all of the columns that you might want to reference. Now if I want to get HireDate, I have to hit the alias for the sub query. In reality, this isn’t a big deal. But is there another way? Yes! Turns out, all we need to do is move the ON clause for Person to be after the join to Instructor. Let’s take a look:

SELECT		Course.Title, Person.FirstName, Person.LastName, Instructor.HireDate
FROM		Course
LEFT JOIN	CourseInstructor
	   ON	CourseInstructor.CourseID = Course.CourseID

LEFT JOIN	Person
	INNER JOIN	Instructor
		   ON	Instructor.PersonId = Person.PersonID
		   AND	Instructor.HireDate < '1 Jan 2000'
ON	Person.PersonID = CourseInstructor.PersonID

This will give you the same result as the subquery, and almost the exact same query plan (it has to add a compute scalar to your plan). SQL does the work of building and running the subquery for you, so you can keep your code clean. I’ve found this to be a reliable option that produces the same plan shape as a subquery, and generally performs just as well as a subquery. I prefer the readability of this, as the select columns tell you immediately where the data is coming from. HOWEVER, this isn’t how SQL is normally written. I have had code reviewers send it back as incorrect, I have had other developers and DBAs confused by what it does. Personally, I like it, but as it’s a bit different from what people are used to, you may want to stick to the normal subquery pattern.

2 Comments

Double-Join Syntax in T-SQL – Curated SQL

Apr 4, 2021, 10:05 pm

[…] Greg Dodd unleashes a monster: […]

Charley Hearn

Apr 4, 2021, 10:16 pm

Great post! I prefer this as well. It was introduced to me in a defunct blog post by Joe Chang as a bushy join. This is because it helps push operators down in the plan making it look more like a bush instead of a tree.

Leave a Reply