Knowing the difference between join types is a very important part of querying a database. The two main joins that I use each day are INNER and LEFT. INNER join means get me everything from table A that has a matching row in table B. LEFT means get me everything from table A and if there is data in table B, then grab that as well.
Where I see confusion is when you throw in a 3rd table
I’m going to use the Stack Overflow database. It has a table called Posts that has all of the questions and answers in it, and a reference back to Posts for the accepted answer. Let’s say we want to get the title of all posts and the date of the accepted answer, if they have one. We might do something like:
When I run that against the 2010 dump of Stack Overflow, I get 1,100,336 records – most of them have an accepted answer date, but a lot don’t.
Now imagine we only want to get the creation date of the accepted answer if the accepted answer has a User that has edited it. If the accepted answer doesn’t have an edit, then we want to ignore it. The default that I often see people go to is to do an inner join from accepted answer to users. It sounds like an inner join – we only want records from accepted answers where edited user is a user:
If we run that, we only get 240,674 records returned, and all of them have a user returned. What happened to the other records? Why did SQL return less?
The reason is that the inner join told SQL to only return rows that join to Users – an inner join only returns records if they match in both tables. So all of the records that don’t have an edited accepted answer are dropped because we have to have an edited user – it effectively turned our left join to get the acceptedAnswer into an inner join.
So how do you get all of the posts, and only accepted answers IF they have an edited user (and the name of the user)? There’s a number of options, we could do a sub query, or a CTE, but I want to suggest my two favourite ways
The Where Clause option is pretty straight forward – either both accepted answer and user need to be null (no accepted answer that has been edited) or both be not null (accepted answer that has been edited).
My favourite is actually the nested join option. The syntax is a bit confusing, but it basically tells SQL that it needs to do an inner join between users and accepted answer – either they both have to be there or neither – and left join the result to the questions. The confusing part of the syntax is that the ON statement for the left join comes after the inner join. I’ve found that although this is confusing to start with, the more I use it the simpler it becomes. It’s a nice clean option that usually optimises well.