Thursday, August 16, 2007

Surrogate vs Natural Primary Keys – Data Modeling Mistake 2 of 10

In case you’re new to the series I’ve compiled a list of ten data modeling mistakes that I see over and over that I’m tackling one by one. I’ll be speaking about these topics at the upcoming IASA conference in October, so I’m hoping to generate some discussion to at least confirm I have well founded arguments.

The last post in this series Referential Integrity was probably less controversial than this one. After all, who can argue against enforcing referential integrity? But as obvious as surrogate keys may be to some, there is a good deal of diversity of opinion as evidenced by the fact that people continue to not use them.

I intend to address this topic by way of a fairly ubiquitous example that should draw out all of the arguments. I’ll investigate the options for primary keys in a Person table. I’ll provide four possible options and explain why each of them is a bad choice. I’ll then give four arguments against surrogate keys, which I will then shoot down. So without further ado:

Contender 1: Name and Location

Of course you can’t use name as a primary key for a person, because two people can have the same name and primary keys must be unique. But all too often I’ve seen databases with multiple, sometime numerous, natural (or business-related) primary keys. These databases combine a field like name with enough other fields such that the likelihood of uniqueness is approaching certainty.

In the case of person this would be equivalent to First and Last Name (wouldn’t want to violate first normal form by combining those into one field, but that’s a whole other topic), zip code, and we ought to throw address line 1 in to be safe. This is known as either a compound, composite, or multicolumn index.

Now our chances of uniqueness are close enough to certain to not warrant discussion, so let’s jump right to space and performance. There are three major problems with this approach.

Con 1: Primary key size. The primary key index for person becomes enormous. The database must now catalog four large (probably all varchar) fields. This increases the size of the index which increases overhead for insert, delete and update operations and can even decreases read speed because of the increased disk I/O.

Con 2: Foreign key size. If you have a child table like PhoneNumber then, as the diagram above shows, the foreign key becomes four columns. Those four columns take up a lot of space again. And now a common query like “Get all phone numbers for a person” involves a full table scan, or, if you throw an index on them you end up with another huge index. In fact, you most likely end up propagating huge indexes and vast amounts of data all over the place like some evil data-cancer.

Con 3: Asthetics. It just isn’t pretty. Having four column foreign keys all over the place increases the amount of code you need to write in stored procedures, middle tier, and presentation tier. Even intellisense won’t help you with this one.

Contender 2: Social Security Number

The most obvious choice for a natural key for a person object is social security number, right? Obviously it depends on what type of data person is, but regardless you’ll probably face the following four problems with this primary key candidate:

Con 4: Optionality. The social security administration specifies that U.S. citizens are not required to provide social security numbers in many circumstances. While employment is one of these circumstances, consumers of non-governmental services are definitely not. You can deny service if your consumer won’t provide the number, but is your CEO prepared to turn away business based on a data modeling decision you make?

Con 5: Applicability. Only U.S. citizens have a social security number. Your system might only cater to U.S. citizens now, but will it always?

Con 6: Uniqueness. The social security administration “is adamant” that the numbers are not recycled, even after someone dies. But eventually the numbers will run out. If you visit the slate article cited above, it calculates this date as in the next century. But the math fails to include the fact that location information is encoded in the number which significantly limits the permutations. I don’t know what the real number is, but the point is: you’re gambling with how long until a conflict occurs. And even if the time argument fails to sway you, just think of who is assigning the numbers. How much do you trust a government office to not make an occasional mistake?

Con 7: Privacy. Does your application use primary keys in the user interface tier to uniquely identify records? Does it pass primary keys between pages or use them to identify rows in a combo box? You certainly wouldn’t store such a thing in a cookie or pass it across the wire unencrypted right? Social security information is sensitive information and privacy zealots care very much how you handle their data. Primary keys are necessarily are closer to end users and harder to hide than regular fields. It just isn’t the type of data to take a chance on.

Contender 3: E-mail

So e-mail is a pretty likely choice right? It’s a relatively safe assumption that no two people share an e-mail (maybe). And anyone with a computer has one right? So there should be no uniqueness, privacy or optionality/applicability problems. But how about this:

Con 8: Accidental Denormalization. Do you have more than one e-mail address? Doesn’t everyone? Imagine what a pain it would be if Evite only allowed you one e-mail address per person (ok, well if you didn’t know it does allow you to consolidate accounts for those of us with multiple e-mail addresses). Even if your system only stores one e-mail address per person now, just think what a pain it would be to change the database to allow N e-mail addresses per person.

No… Wait. Really. Think about it...

Yea … yuck.

Contender 4: Username

If your users log in with a username, that’s a likely candidate for a primary key right? But what if they want to update their username (perhaps it was based on a last name that changed). This leads us to:

Con 9: Cascading Updates. If you have a natural key that might change you’ll need to implement some type of cascading updates (whether your DBMS supports it or you write code by hand). In other words, change the username in the person table and you have to change the username foreign key in all child records of the invoices, comments, sales, certifications, defects, or whatever other tables you track. It may not happen often, but when it does it sure will wreak havoc on your indexes. Imagine rebuilding even 10% of your indexes at once because of one operation. It’s just unnecessary.

Con 10: Varchar join speed. I left this to last because it applies to all of contenders thus far and is by far the most compelling argument against natural keys. Nine out of ten natural keys are varchar fields. Even an employee number as generated by another system may have a significant zero. It’s a fact: joining across tables with varchars is always slower than joining across tables with integers. How much? According to Peter Zaitsev who runs a MySql performance blog it’s 20% to 600% slower. And that’s for one join. How many joins do you think comprise an average user interaction? Five? Ten? Twenty? It could very likely make a significant difference to your end user.

And The Winner Is

So surrogate keys win right? Well, let’s review and see if any of the con’s of natural key’s apply to surrogate keys:

  • Con 1: Primary key size – Surrogate keys generally don't have problems with index size since they're usually a single column of type int. That's about as small as it gets.
  • Con 2: Foreign key size - They don't have foreign key or foreign index size problems either for the same reason as Con 1.
  • Con 3: Asthetics - Well, it’s an eye of the beholder type thing, but they certainly don’t involve writing as much code as with compound natural keys.
  • Con 4 & 5: Optionality & Applicability – Surrogate keys have no problems with people or things not wanting to or not being able to provide the data.
  • Con 6: Uniqueness - They are 100% guaranteed to be unique. That’s a relief.
  • Con 7: Privacy - They have no privacy concerns should an unscrupulous person obtain them.
  • Con 8: Accidental Denormalization – You can’t accidentally denormalize non-business data.
  • Con 9: Cascading Updates - Surrogate keys don't change, so no worries about how to cascade them on update.
  • Con 10: Varchar join speed - They're generally int's, so they're generally as fast to join over as you can get.

For every natural key con I see a surrogate key pro. But not everyone agrees. Here are some arguments against them.

Disadvantage 1: Getting The Next Value

Some have argued that getting the next value for a surrogate keys is a pain. Perhaps that’s true in Oracle with its sequences, but generally it just takes a couple minutes research, or you can use ORM tools to hide the details for you.

Disadvantage 2: Users Don’t Understand Them

One argument I uncovered is if users were to perform ad-hoc queries on the database they wouldn’t be able to understand how to use surrogate keys.

Bunk. Bunk, bunk, bunk. End users shouldn’t be fiddling in databases any more than airline customers should be fiddling in airplane engines. And if they are savvy enough to try, then let them learn to perform joins like the pros do.

Disadvantage 3: Extra Joins

Suppose you have users table with a social security number natural primary key, and a phone number child table with social security as a foreign key.

If your user enters a social security number on a log in screen you could theoretically get their phone numbers without accessing the users table. In a surrogate key world you would have to look up the surrogate key in the person table before getting their phone numbers.

While this is true, I have found that with most CRUD applications there are few times when this scenario comes up. The vast majority of queries involve already known surrogate keys. So while this argument may be true in some situations, it just isn’t true enough of the time to be significant.

Disadvantage 4: Extra Indexes

I find this to be the most persuasive argument against natural keys. If your person object would normally have a natural key on social security number, then in surrogate-world you should have a unique index on social security number in addition to your primary key index on the surrogate key. In other words, you now have two indexes instead of one. In fact, if you have N indexes per table in natural key world, you’ll always have N + 1 indexes in surrogate key world.

While the additional indexes do indeed add indexes, which increase database size, and slow insert and update performance, you could offset some of that expense by converting your old natural key, social security number for example, to a clustered index.

Or you could just relax in the knowledge that there are pro’s and con’s to every architectural decision and for surrogate keys the pro’s outweigh the con’s.

Summary

So now if some well meaning DBA argues to use natural keys on your next project you should have ten arguments against them, which will double as ten arguments for surrogate keys, and you should be prepared with rebuttals for four arguments against surrogate keys. Whew, that was a lot. But I assure you, if you use surrogate keys today it will definitely make your life easier in the long run.

14 comments:

Anonymous said...

I put a lot of time into this response. Unfortunately it went late into the night and a few details are off. Hopefully it will still be of some use.

Seth

Haacked said...

Not to mention that in some circumstances, a person can change his or her SSN number. For example, if it's been used fraudulently.

Anonymous said...

"if you use surrogate keys today it will definitely make your life easier in the long run."

But not if you use sequence numbers generated by the database server.

Because in the long run all databases and their applications are expected to be distributed on the network.

And sequence numbers used as unique resource identifiers (surrogate primary keys) cannot practically be distributed in a network, locking your applications in the single database server model ... and its implementation.

One way to have best of both world is to use a standard hash function (SHA1 for instance) and let your network applications generate surrogate primary keys, not in the database server.

Lucas Goodwin said...

To the previous about SHA1. Isn't that what the Unique Identifier/GUIDs are for?

There is one area that ALL DBAs need to value highly (And few do) with software development (The primary consumer of all DBs) and that's change. Specifically ease of change.

My experience (We rarely have a single application not change dramatically atleast every couple of months) has been that the fewer primary keys in a table the easier to refactor the data-model as requirements change.

We've found that just using one column for a primary key per any given table meets the needs to maintain integrity while allowing change and migration to be far less painful in the long run. Even if the index is completely arbitrary and has nothing to do with the actual data.

Even the most complex of Data-migration/normalization programs only require a few thousand lines of code to create, but if the DB is enforcing relationships so tightly a Nun would be envious, say good bye to easy change.

I also can't stress enough how many times I've been bit in the butt by the assumption that a particular chunk of data will always be unique just to have a change request come along later to allow duplicate entries for that same chunk of data.

Joe said...

Getting the next value of a surrogate key really should not be considered a disadvantage in my opinion. It is simple enough to provide the SCOPE_IDENTITY() as an output parameter of the stored procedure performing the insert(s).

If getting the next key value is that much of a stumbling block for a developer who *must* set the key client-side, fine...then you just use GUIDs as the surrogate key.

Ian Batty said...

Surrogate/ Natural Keys is an argument that will never die. In fact, I would go as far to say that the 2 camps are as strident as a belief system. Or, to use another word: A religion.

Just as it is extremely rare* for a fervent believer in one religion to convert to another so it is almost impossible to convince a natural key Data Modeller to use surrogate keys or vice versa.
*Note: I state it is not impossible, just rare.
Note 2: I don't mean non-believers picking a non-standard religion either: I am talking hard core christian becoming muslim or similar

Personally I am a natural key guy. Saying that, and this is where your example loses credibility, practical experience has shown that for a person there is no natural key, so I create a surrogate. However, if I am creating an entity that has a true unique identifier, that doesn't change and meets all the other requirements then I will use it.

Now, many of those unique identifiers are in fact not 'natural' because they are artificially generated e.g. a legal Document ID but they are generated outside my system and as such I will not create an additional surrogate inside. This means that my PK may be a different data type in different entities (string vs number) but that is a price I am willing to pay

So, I consider myself to be a natural key modeller, but I will use a surrogate key where practical.

But then I don't believe in God either

Ian
maguffyn@yahoo.co.uk

website design nyc said...

nice post

Luke Kneller said...

Just use both for god's sake. There are places in databases where Composite keys are much much better. Others were using a surrogate is far superior.

Is the use of one over the other exclusive? I think you'll find it isn't.

Member or people details should be surrogate. ID's, auto generated. I agree. It would be stupid otherwise. But using a Surrogate on a forum relationship would be stupid.

New post would be a surrogate, person who read such a post would use a composite between post_id and member_id. A surrogate in that situation would be wasteful and stupid.

Anonymous said...

I got issued the same SSN as someone else and was required to change it (12 years later =)

Anonymous said...

Great post. I am currently in the process of trying to persuade many people that surrogates are better than naturals. They present the *exact* same arguments to me that you mention here.

Another nasty that I've come across when end users get their grubby paws on a table with a meaningful lookup FK (by this I mean a child table with a column referring to a parent lookup table with a natural PK) is that since they 'do not need' to look at the parent lookup table (since the key is meaningful) they will quite happily perform a "select distinct " from the child table to get all possible variants of that column.

Horrible horrible.

Jimmy V said...

Surrogate keys work well in this type of situation where a natural key is ambiguous, at best. However, when the natural key is concrete and static, using natural keys is the way to go. For example in the case of a vehicle, each vehicle has a VIN number, houses have addresses, or lot numbers assigned them via the county.

I guess keeping to one dogma or the other is just like a belief system where you can get rigidly stuck in one train of thought, unable to see benefits of the other and the shortcomings of yours.

What this comes down to is the data model that you are currently designing; we are all professionals and should be able to decide when to use which design scheme!

El Oso said...

Sorry sir, if you came up with a surrogate key the answer to your dilema, it´s only because you either failed to model your real world entity, or just wanted to make things easier for you by throwing away a couple of good design practices.
On your side, you should also take into consideration that surrogate keys migth even be a real world element that you are failing to identify.
In the long run, nothing but good design practices can make your application sustainable.
Respectfully,

Anonymous said...

Stumbled on this article, and glad I did, very interesting discussion!

I learned a lot, thanks! This post here is good too on natural keys:

http://www.programmerinterview.com/index.php/database-sql/natural-key-in-database/

Dejan Lekic said...

Roy Hann wrote a pretty good set of articles that argue *against* surrogate keys. Here is the first one: http://community.actian.com/forum/blogs/rhann/124-surrogate-keys-part-1-best-practice-wrong-headed-dogma.html

I used surrogate keys a lot in the past, but with time I started using natural keys more and more because simply I like being able to **understand** data in the database, not just seeing numbers...