When using the HASHBYTES() function in SQL Server to hash multiple columns for matching, use CONCAT() and separate values with a pipe.
Usually, the values in columns are disparate enough that you really do not have to worry. For example, it was years before I actually discovered an edge case. However, I did discover at least one scenario where concatenating the columns was not enough to develop a unique record. This occurred in a table with few columns and small amounts of data.
A better approach is to just CONCAT() columns with a pipe between values. Since this character is rarely used, it lowers the probability of having different values hash the same.
A Walk Through Of The Problem And Solution
Here is the edge case scenario that got me into trouble.
USE demo
DECLARE @SampleStageTable AS TABLE(ID INT, VALUE1 NVARCHAR(10), VALUE2 NVARCHAR(10))
INSERT INTO @SampleStageTable(ID, VALUE1, VALUE2)
SELECT 1, '012','345'
UNION
SELECT 2, '01','2345'
SELECT * FROM @SampleStageTable
As you can see, record 1 and record 2 are two entirely different records. But look what happens when we hash the row!
USE demo
DECLARE @SampleStageTable AS TABLE(ID INT, VALUE1 NVARCHAR(10), VALUE2 NVARCHAR(10))
INSERT INTO @SampleStageTable(ID, VALUE1, VALUE2)
SELECT 1, '012','345'
UNION
SELECT 2, '01','2345'
SELECT ID, VALUE1, VALUE2, HASHBYTES('MD5', CONCAT(VALUE1, VALUE2)) AS ROWHASH
FROM @SampleStageTable
UH OH! Two different records hashing the same? That's not cool! Let's fix this. This is why we use pipe!
As you can see, the addition of the pipe makes it so the two records can be clearly delineated.
USE demo
DECLARE @SampleStageTable AS TABLE(ID INT, VALUE1 NVARCHAR(10), VALUE2 NVARCHAR(10))
INSERT INTO @SampleStageTable(ID, VALUE1, VALUE2)
SELECT 1, '012','345'
UNION
SELECT 2, '01','2345'
SELECT ID, VALUE1, VALUE2, HASHBYTES('MD5', CONCAT(VALUE1,'|', VALUE2)) AS ROWHASH
FROM @SampleStageTable
Copyright © 2020, Mass Street Analytics, LLC. All Rights Reserved.