Below is the transcript of a fascinating discussion I had on how the WHERE clause actually functions.
Bob
I’ve been out in industry for over 10 years now doing DB work most of that time. Lately I’ve been doing some work that is starting to challenge my long held assumptions about the internal workings of SQL Server, specifically what exactly is happening when I do joins and where clauses.
I was explaining to someone that when doing a join putting in a where clause limits the amount of records that are available to do the join on thereby making the query faster. My thought was if a table has 100MM records but only 50MM of those meet the WHERE criteria then the JOIN clause wouldn’t try matching the 50MM records that don’t meet the WHERE clause.
I did some work that afternoon and the empirical results didn’t match the expected outcome of a query I had written in terms of execution time. I don’t think my understanding of what goes on during a JOIN is correct. Can someone educate me?
Person 1
There is a definition of the order of the process of JOINs and the WHERE clause. Then there is what SQL actually does. These may be different.
But the definition is the FROM clause including any JOINs is done first. In general, if you have multiple JOINs, they are done left to right in the order of the ON clauses. So, if you have
FROM A
LEFT JOIN B
INNER JOIN C
ON B.ID = C.ID
ON A.ID + B.ID
then first B is joined to C and then that result is joined to A. Of course, in most cases people write the ON clause immediately after the JOIN, so you will see something like
FROM A
LEFT JOIN B
ON A.ID = B.ID
LEFT JOIN C
ON A.ID = C.ID
And, the rules of which join is done first can become complex if you are mixing both OUTER JOINs and CROSS JOINs.
But the WHERE is done after all of the JOINs have been processed.
But that's only the definition of what is theoretically done. When the query is actuality done, SQL can (and does) process the query in any manner it thinks is most efficient as long as the result is the same as if it had followed the above definition. In particular, if your query has only INNER JOINs and CROSS JOINs and a WHERE clause, then the order things are done will make no difference in the result. So in that case, SQL will do the joins and where in any order it thinks is fastest. So that could mean doing the joins first, or doing the where first, or doing part of the where first, followed by some of the joins followed by the rest of the where clause followed by the rest of the joins or whatever SQL thinks will be fastest given the available indexes and statistics.
Person 2
The important concept here is that in SQL you specify the result you want, not the algorithm to create that result. It's helpful to think about the FROM clause executing first, followed by the WHERE clause in order to reason about the results and the correctness of your query. But that's not how SQL Server actually runs the query.
The Query Optimizer looks as the query, the join criteria and the where clauses and considers several different algorithms (query plans) to get the specified result. It assigns a "cost" to each potential plan and then executes whichever plan has the lowest cost.
So here it might start with a where clause, or it might start by joining tables and apply the where clause criteria later.
Person 3
Here is how a SELECT works in SQL ... at least in theory. Real products will optimize things, but the code has to produce the same results.
As you can see, things happen "all at once" in SQL, not "from left to right" as they would in a sequential file/procedural language model. In those languages, these two statements produce different results:
READ (a, b, c)
FROM File_X;
READ (c, a, b)
FROM File_X;
While these two statements return the same data:
SELECT a, b, c
FROM Table_X;
SELECT c, a, b
FROM Table_X;
Think about what a confused mess this statement is in the SQL model.
SELECT f(c2) AS c1, f(c1) AS c2
FROM Foobar;
That is why such nonsense is illegal syntax.
Bob
I have a ton of books on T-SQL but none of them go into the real details of how SQL Server does it's thing. If I take all of your comments this is my take away. Let me know if you agree.
Person 1
Copyright © 2020, Mass Street Analytics, LLC. All Rights Reserved.