Wednesday, July 18, 2007

Referential Integrity - Data Modeling Mistake 1 of 10

In my mind data models are like the foundations of a house. Whether you use ORM or a more traditional modeling tool, they form the base of the entire rest of your project. Consequently, every decision you make (or don’t make) regarding your data model during the design phase(s) of your project will significantly affect the duration of your project and the maintainability and performance of your application.

You could de-emphasize up-front planning, but every correction you make to the data model once code has been written on top of it will introduce significant delays to the project as developers refactor data access, business logic, and user interface tiers. That’s why mistakes made during design are expensive, and it would behoove any architect (or project manager) to be well aware of the repercussions of data model decisions and minimize mistakes before construction begins.

After years of working with or maintaining applications based on poorly designed data models, and after years of modeling my own databases from scratch I’ve seen and made a lot of mistakes. So, I’ve compiled ten of the most common ones and the arguments for and against them.

I’ll be speaking on this topic in the upcoming IASA conference in October, and so I wanted to vet these ideas with the community. I know there are strong feelings on these topics, so please help me out by commenting if you feel I’ve missed something or am off base.

I’ll start with Mistake #1: Not Using Referential Integrity in this post. I'll give four common reasons for avoiding referential integrity and then rebuff them. I'll then cover the more controversial Mistake #2 Not Using Surrogate Keys in my next post.

Mistake #1 – Not using referential integrity

I’ve heard a lot of excuses for not using referential integrity, but I’ve never been swayed by one of them. If you have a record with a foreign key field you should be 100% certain that it will always refer to the primary key of an existing record in one and only one foreign table. The last thing you want to do is write large amounts of conditional logic because you aren’t 100% certain that you aren’t dealing with orphaned data. Nonetheless, here are some almost compelling arguments I’ve heard for not using it:

Reason #1: Project Too Small

If your project or database is only a few tables and a couple lines of code then you don’t need referential integrity right? Wrong, numerous projects start small, get big, and have major problems because of it. It doesn’t take much extra time to put in constraints. Avoid the urge to be lazy.

Reason #2: Accidental Oversight

Numerous applications I’ve seen forget a relationship or two. This is borne of writing and executing database creation statements by hand and is the reason that data modeling tools exist. When you visualize your database in a model it’s hard to miss a relationship. So use a modeling tool and keep it in sync with your database, you won’t regret it.

Incidentally I like Microsoft Visio for data modeling because you can change your schema during development and Visio won’t delete your data. This enables you to keep your data model in sync with the database for the entire lifetime of the database. There are other benefits too, if you’re interested see my article on data modeling in Microsoft Visio.

Reason #3: Maximize Insert Speed

It’s a fact: indexes and constraints slow down insert and update operations. If your application is heavy on writing and light on reading, then you could argue referential integrity isn’t for you. This argument is often combined with the “Only one application ever uses my database” argument.

There are two problems with this. One problem comes when either a well meaning DBA modifies data by hand and messes up the state of the database, or more realistically when there’s a bug in the application that accidentally orphans data. Orphaned data may not affect your application, but a well designed solution should plan for the future. When that data warehouse project finally gets around to importing data from your database, what do they do with the orphaned data? Ignore it? Try to integrate it? Who knows? If you’ve been in this position, you’ll know what I mean when I say the responsible architect’s name (or their app) will be synonymous with a curse word.

The second problem is that even if a database without referential integrity don’t end up with orphaned data, a second application that might want to integrate can still never be 100% certain that foreign keys refer to existing records. It comes down to designing for the future.

The answer to speed is to build your database with referential integrity, drop or disable your constraints and indexes before a bulk load, and re-enable them after the bulk load. It will increase the duration of your bulk load operation over not using constraints at all, but it will be much faster than leaving them enabled and checking them for each insert. So use referential integrity: the pros outweigh the cons.

Reason #4: Mutually exclusive relationships

Too often I’ve seen databases with a foreign key that relates to one of five tables based on the value of a char(1) field. The space conserving mindset that comes up with this implementation is admirable, but it produces far too many negative side effects.

What happens when the char(1) field gets out of sync with the foreign key field? What happens when someone deletes the foreign record or changes its primary key? More orphaned data happens.

The solution is to use five fields that each refer to a single table. You may have more nullable fields that take up more space in the database, but it’s worth it in the long run.

Conclusion

Well, hopefully I’ve convinced you to avoid the urge to be a lazy data modeler, design for the future, use a data modeling tool, and drop constraints during bulk load operations. In short, always use referential integrity. But if not, hopefully you’ll at least understand when people curse your name several years from now. :)

14 comments:

McGurk said...

I agree with you 100%. Only fuckups don't use RE in their databases.

I'm currently working on a web frontend for a software product with a massive backend. We're talking about two central databases with over 80 tables, and as the installation grows additional databases are added, each of those having about 30 tables, some of which contain FK's from the primary two databases. None of these databases have RE enforced. And none of the CrUD queries of the main program are run in transactions, so orphans are an every day occurrence.

Whereas Reason #1 may seem to be correct in respects to avoiding premature optimization, this project started out exactly this way. And now we have clients with 30-40 odd databases in one system, each one with FK's from the main database, none of which having any RE at all. Its a complete clusterfuck.

Reason 4 is also something that should NEVER EVER be done. This same project has a field in a table which, when int column X is a certain value, is a FK for one of the security tables, and when column X is a different value, is a long value used for sorting and other functions. How fucked up is that? People who do 4 should be flayed alive and staked out over a fire ant hill.

I think you can sum up this whole post with one simple statement: Don't let idiots design your database.

Brett said...

Nice article and I agree 100% with your points.

RE is essential, all the more so when on an Agile project. I would highly stress making any database system as RIGID as possible when it comes to columns and tables.

In the future you can always relax these constraints when necessary. It's more akin to incremental programming in my mind.

MattK said...

Reason #5: "Because SQL is hard! And ORMs keep all my RI in the application!"

Seriously(/sadly), I have heard this one all too often recently.

Anonymous said...

On #4...what if there are 200 tables? Then what?

Bradrick Manor said...

"On #4...what if there are 200 tables? Then what?"
This is a very vague question. I have very rarely seen this type of situation and I was able to come up with another solution that still allowed me to use a foreign key. Bottom line though, if you have to, create 200 columns. This will be a lot less work then trying to diagnose a correct an issue later.

Do you actually have this problem, or are just being argumentative?

Carlos Alberto Costa said...

On item 4, what should I do if my database has composite keys and those fields are part or the primary key or if the table is an association table in an n to n association?

Lee Richardson said...

McGurk: If I had a nickle for every time you make me laugh... Sorry to hear about the work situation, I feel for you.

Lee Richardson said...

"On #4...what if there are 200 tables? Then what?"

Bradrick Manor: Thanks, I couldn't have said it better myself.

Lee Richardson said...

Carlos Alberto Costa: I'm not sure I understand. Could you provide an example?

Incidentally, if by composite key you mean more than one natural key making up the primary key then I'll tell you all about how I disagree with that in my next post on surrogate keys. I suspect this view will be a little more controversial than the RE one.

Alex said...

If there are 200 tables which need to be referenced by a pure key field in the same table, that means you have 200 entities with common functionality. In this case you should probably put some multi table inheritance in place so you can make a decent OO representation of these 200 entities.

This way the 200 tables all reference one parent table which contains a unique id which can be used as the foreign key.

Andras Ludanyi said...

I agree with most of the things pointed out here, but I have to add that the "database is the foundation" approach is a mistake also and a big one. The business process (what to do and then how to do) should be the "foundation" of the application and the database should serve the application just as any other part do. Database centric approaches are one of the very most causes of bad design in the first place.

Anonymous said...

Do you really want to add a column for every new type of commentable, and have hardcoded switches all over your code?

The right way to to implement a polymorphous relationship is a string field `commentable_type` that contains the name of the table the foreign key `commentable_id` points to. That's flexible, DRY and does without magic numbers, while conserving referential integrity.

fisherv said...

Quote from the previous -
"The right way to to implement a polymorphous relationship is a string field `commentable_type` that contains the name of the table the foreign key `commentable_id` points to. That's flexible, DRY and does without magic numbers, while conserving referential integrity."

This DOES NOT conserve referential integrity. That is the whole point. You CANNOT define a foreign key constraint on a column that maps to ids in multiple tables.

priya said...

please tell me how should I develop a mini database project on population management.. Im little confused and scared since Im doing it alone.. give some idea.