SQL Server is strongly typed and by that, I mean you have A LOT of options when it comes to how you want to store data.
Data types define how information is stored on disk. This is a complex topic and I cannot do it justice in a tutorial. The most important part of understanding data types is that different data types take up different amounts of space in terms of 1s and 0s on disk for the same human readable data point. Put another way, the amount of space that the phrase “Eat my shorts man” takes up on disk will change depending on what data type you select for the column the phrase is stored in.
We are not going to talk about all the data types, just the ones that I think are the most important for you to understand. In general, your overall data type strategy should be defined by these two concepts:
Money
This is the first data type we’re going to talk about because it is the data type that defined my “store data for what it is” ethos. If you are dealing with monetary values, store those values as money.
Some people like to use one of the other decimal data types. Yeah. Don’t do that.
The money data type stores four decimal places which is in line with banking standards. You do not need more precision than that.
CHAR VS. VARCHAR
The difference between these two is CHAR stores a fixed character length. That means if the data you are storing is less than the defined character length, you will be wasting disk space. I use CHAR to store values like flags or codes which I know a priori do not deviate from length.
VARCHAR is more flexible. It flexes depending on the size of what is being stored. The final storage space will be the whatever bytes the value takes up plus two more bytes.
And on that topic, I have a secret to tell you. Both of these data types are defined by a parameter n. It is very easy to think that n represents the number of characters that can be stored. That is FALSE! The value n denotes BYTES! However, when I see a flag of some defined width say two characters, I’ll totally be like CHAR(2) and call it a day. And I’ve been sliding uphill doing that for twenty years.
VARCHAR VS. NVARCHAR
I will make this simple. Make your life easy and store string values in NVARCHAR. It will save you so many headaches. It has to do with ISO standards on encoding. At one point, I knew the deep technical reasons for the difference in the data types, but at the end of the day, 99% of my problems with string data go away when I use NVARCHAR.
When I need to store large amounts of text, I prefer to use NVARCHAR(MAX) over the text data type.
Date And Time
SQL Server has several data types for storing dates. I have a tendency to use datetime2 but will use others to save disk space when the scenario allows for it.
Numeric VS. Float
On the surface, these look like the same thing because they both store decimal values. They are not.
Numeric is the most precise way to store decimal information. Float is only an approximate number data type.
When I need to store decimal values that are not money, I do it with numeric. However, there are some edge cases where I have been forced to use float. These edge cases are really arcane. Almost every scenario has been unique and occur rarely. So rarely that I cannot even bring a specific scenario to mind.
Binary
This is the data type that I use when I want to store values that result from the HASHBYTES function. The output of HASHBYTES generates a character string of fixed width. The width depends on the hashing algo you use. HASHBYTES with binary is a very efficient way to store data.
Integer Data Types
SQL Server has numerous data types for storing integer values. It all depends on how big of an integer you need to store which means understanding the possible ranges of values.
At all times you should be trying to minimize storage space. There have been several instances where I have been comfortable using tinyint.
All of my data warehouse tables have BIGINT as the data type for primary keys. Warehouse tables can get really big and, one day, I actually blew up a primary key that was an int. The upper limit of int is 2,147,483,647. And I blew past that. Switched to BIGINT and have not had a problem since.
Unique Identifier
While I normally use auto incrementing BIGINTs as the primary keys in data warehouses, I will use uniqueidentifier for the primary key of staging tables.
Importing data into staging tables can be a messy process. Having a primary key that was just an auto incrementing value became problematic when the number of staging tables I was dealing with started to get large.
It was possible that several tables might be carrying the same values for primary keys at the same time which made troubleshooting loads difficult.
Switching to a data type that was a globally unique identifier made a lot of headaches go away. Now I could tell for certain which records originated from which tables. I made the switch to using a GUID a few years ago and never looked back.
Copyright © 2020, Mass Street Analytics, LLC. All Rights Reserved.