Thursday, August 30, 2007

Create Data Disaster: Avoid Unique Indexes – (Mistake 3 of 10)

I really enjoyed Seth Schroeder’s critique of the last post in my ten part data modeling mistake series: Surrogate vs Natural Primary Keys. His argument regarding data migration in particular sheds light on a major shortcoming of using surrogate keys: they lead data modelers to a false sense of security regarding the uniqueness of data. Specifically if modelers ignore uniqueness constraints they allow duplicate data. And as Seth points out this has a nasty side effect of disallowing any clear way to compare data between systems. But there are other problems too.

So, in this post I’ll address the uniqueness problem introduced with surrogate keys by way of an example, I’ll provide two how-to’s, one implementing uniqueness in Visio and one in NHibernate, I’ll explain the difference between unique indexes and unique constraints, and finally I’ll provide reasons why unique indexes might be overlooked, specifically by providing a critique of ORM tools.

Surrogate Keys = Data Disaster?

So as mentioned above the biggest problem with surrogate keys is they lull junior data modelers or lazy developers into thinking they don’t need to worry about indexes. But they do; and it’s as vital as implementing referential integrity. And for the same reason: data integrity.

As an example, imagine you’re modeling a simple Country table. You could of course use CountryName as the primary key, but as you know from my post on surrogate keys, you would have problems with varchar join speed (assuming you disagree with Seth that it’s a premature optimization) and to a lesser extent cascading updates (since country names do occasionally change).

Introducing a surrogate key (CountryId) resolves these issues, but you also remove an inherent advantage that natural keys have: they require uniqueness in country names. In other words you can now have two New Zealand’s and the system wouldn’t stop you.

What’s the big deal? Country seems like a pretty benign table to have duplicates, right? Your users from New Zealand simply have an extra list item in their drop down to pick from and some pick one and some pick the other.

For Country one problem comes in reporting. Consider delivering a revenue by Country report. Your report probably lists New Zealand twice and a quick scan by an exec sees half of the actual revenue for that country that they should. And as a result numerous innocent sheep are slaughtered … uh, or something.

Another major problem could come in syncing data with other systems. How do those systems know which record to use?

As you can imagine the problem is even worse with major entities like Customer, Order, Product, or something more scary like Airline Flights. And the longer the system stays in production, the more production data the system collects, the more duplicates rack up, and the more time and money that will be required to clean up the data when the problem is finally identified. In short the bigger the data disaster.

How To #1: Visio

So the solution is to add at least one unique constraint (or index) to every single table. In other words if you have a table without a uniqueness constraint chances are very good you’ve done something wrong.

The good news is that it’s pretty easy to implement once you agree it’s necessary. If you’re modeling with Microsoft Visio this is a six step process:

  1. Select the table.
  2. Select the “Indexes” category.
  3. Click New.
  4. No need to enter a name, just click OK.
  5. Select either “Unique constraint only” or “Unique index only” (more on this decision later).
  6. Double click the column(s) to add.

Then when you generate or update your database Visio puts in DBMS specific uniqueness constraints. And voila, problem solved.

Unique Constraints vs Unique Indexes

The question will come up when using Visio or perhaps using various DBMS’s including SQL Server whether to use a unique constraint or unique index. The short answer is that most people use unique constraints, but ultimately they’re the same thing so it doesn’t matter.

In case you’re interested in the details though here’s a quick rundown of the differences:

Unique Constraint

  • A logical construct.
  • Defined in the ANSI SQL standard.
  • Intent: data integrity.
  • Usually part of a table definition.

Unique Index

  • A physical DBMS implementation.
  • Not specified in ANSI SQL.
  • Intent: performance.
  • Usually external to a table definition.

But since most DBMS’s implement unique constraints as unique indexes, it doesn’t really matter which you choose.

How To #2: NHibernate

Since I have the pleasure of learning the NHibernate ORM tool on my current project, I thought I’d also describe the same technique with a different tool. Basically you can either set the Unique attribute to true to obtain uniqueness in one column, or set the unique-key attribute to obtain uniqueness among multiple columns. If you use NHibernate mapping attributes you write:

[Property(NotNull = true, Length = 100, Unique = true)]

public virtual string CountryName {

    get { return _strCountryName; }

    set { _strCountryName = value; }

}

Which generates the following hbm:

<class name="Country">
    <id name="CountryId"><generator class="sequence" /></id>
    <property name="CountryName" length="100" not-null="true" unique="true" />
</class>

Which NHibernate turns into the following DDL:

create table Country (

   CountryId NUMBER(10,0) not null,

   CountryName NVARCHAR2(100) not null <b>unique</b>,

   primary key (CountryId)

)

So quick quiz: was that a unique index or unique constraint it generated? If you answered who cares you’re right. However if you answered a unique constraint you’re also right.

The Problem with ORM

Obviously ignorance of the problem and shortsightedness are two causes for systems going into production without unique indexes, but I’d like to point out a third. While Object Relational Mapping (ORM) tools like NHibernate are extremely convenient for generating database schemas, modeling database tables with classes and generating DDL can lead developers to a false sense of purpose.

This can occur because ORM tools focus entirely on the world of objects and classes. In this world data’s persistence is irrelevant. It exists for the purposes of a single operation, and consequently long term data persistence issues like data integrity are deemphasized. In fact, it would be easy to lose perspective of the fact that there is a database at all.

Don’t get me wrong, the benefits you get like mandatory surrogate keys, DBMS neutrality, lazy loading, and minimal data access code are wonderful. Just don’t forget that tags like NHibernate’s unique and unique-key exist. And are very necessary.

Conclusion

To sum it up don’t allow the convenience of surrogate keys to lull you into a false sense of security regarding the importance of keys. It’s a trap. And it will be a disastrous one if you aren’t careful.

6 comments:

Jeff Smith said...

Good post, thank you for addressing this issue.

I have written a recent post that covers composite primary keys, here:

http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx

It addresses lots of the same issues; mainly, just tacking on IDENTITY columns and doing nothing else to ensure that your data has integrity.

Jeff said...

sorry, here it is as an actual link:

http://weblogs.sqlteam.com/jeffs/archive/2007/08/23/composite_primary_keys.aspx

Bruce said...

Good article, but I had to stare at the title for a bit to figure out what you were getting at. Unique indices are good? Bad? Umm, what?

Matheus said...

great posts!

I´d like to know about the next episodes! Are they coming soon? Or is there a place we can check them out?

Thanks in advance!

KEK said...

So where are mistakes 4 through 10? I can only find the first three...

Nick said...

Mistake #4 - there are only three mistakes?