Lesson 61. Why You Cannot Join On Null Values

This is a fairly common and insidious error because you might not catch the problem if you are working with a large amount of data. Basically, NULL means unknown and unknown cannot equal unknown, so you cannot join records on null values. The best thing to do is not have NULLs in candidate key fields.

When I was working this problem and mentioned this was weird, another data engineer had this to say:

No, it is not weird. It actually follows from the definitions in the ANSI/ISO standards. The in fixed notation that we use now, was actually a result of an article published by Chris Date in one of the database newsstand magazines of the time (he and I had columns in DBMS and Database Programming and Design, and they were both owned by the same publisher at one point).

The actual syntax was not proposed by Date, by one of the ANSI X3H2 committee members. The rule is that infixed joins are evaluated strictly left to right, and the ON clause associates with the nearest join. If I have a really complicated set of inner joins, I prefer to use the original set-oriented notation (no infixed operators). People that use infixed inner join have a mindset locked into + versus "big Sigma" from mathematics.

Examples

A Walk Through Of The Problem And Solution

My expectation here would be to get back results such that I get all the genders and only matching records for toys. However, when I apply a filter on toys, the dataset has a null value for gender ID in the toys table. My left outer join now behaves as an inner join.

In [ ]:
USE demo

DECLARE @genders TABLE(gender_id INT, gender VARCHAR(20))
DECLARE @toys TABLE(toy_id INT, gender_id INT, toy_name VARCHAR(20))

INSERT INTO @genders(gender_id, gender)
SELECT 1, 'boy'
UNION ALL
SELECT 2, 'girl'
UNION ALL
SELECT 3, 'both'

INSERT INTO @toys(toy_id, gender_id, toy_name)
SELECT 1, 1, 'GI JOE'
UNION ALL
SELECT 2,2, 'My Little Pony'


SELECT g.*
FROM @genders g
LEFT OUTER JOIN @toys t
ON g.gender_id = t.gender_id
WHERE t.toy_name NOT IN(
'My Little Pony'
) 

Here is the correct approach to the problem.

In [ ]:
USE demo

DECLARE @genders TABLE(gender_id INT, gender VARCHAR(20))
DECLARE @toys TABLE(toy_id INT, gender_id INT, toy_name VARCHAR(20))

INSERT INTO @genders(gender_id, gender)
SELECT 1, 'boy'
UNION ALL
SELECT 2, 'girl'
UNION ALL
SELECT 3, 'both'

INSERT INTO @toys(toy_id, gender_id, toy_name)
SELECT 1, 1, 'GI JOE'
UNION ALL
SELECT 2,2, 'My Little Pony'


SELECT g.*,t.toy_name
FROM @genders g
LEFT OUTER JOIN @toys t
ON g.gender_id = t.gender_id and t.toy_name NOT IN
('My Little Pony') 

Here is an approach that filters first by encapsulating the filter logic in a CTE.

In [ ]:
USE demo

;with mycte as (
select * from  @toys WHERE toy_name NOT IN('My Little Pony')) 

SELECT g.*, t.toy_name
FROM @genders g
LEFT OUTER JOIN mycte t
ON g.gender_id = t.gender_id

Copyright © 2020, Mass Street Analytics, LLC. All Rights Reserved.