The Normalised vs. De-normalised data debate

25 Apr 2012 - 19:52 UTC.

Recently I've been reading a lot of "experts" on sites like Stack Exchange saying that De-normalised data structures are better than Normalised data structures for performance. The simple fact is you should know the business problem and the data before making any recommendations; and someone should test the impact of their design on large chunks of test data. A system with a few hundred rows will act very different from a system with a few hundred million rows in. End of the day you need a system that can work in a business as usual environment with no maintenance just to stop it falling over.

An argument given is that Normalised data is smaller for storage, but that disk space is cheap. Both are true but system memory and CPU cycles aren't cheap. Also the time you can fit a tape backup during an overnight process tends to be very limited! Big data doesn’t mean the size it takes to store, it’s the value it offers your business.

A normalised system that is designed correctly is not only smaller on storage and faster for updates, it can also be quicker for seeking as well because you only have to look for a unique value in a reference table, and then use the key in the foreign table. The impact of this is even greater if you're using a string operation, because string operations are slower than numeric operations. Normalised data is also easier to design a memory efficient cache around, which will use less system memory and thus have fewer calls to the source system. It may take a bit more thinking than a flat file, but get it right and you will have a system that has far better scalability, and usually system operators that are happier that you aren't hogging resources.

I've worked with pulling data out of a ~2TB system and normalising it down to 35% of that. The fact is the 2TB system is unwieldy and most the work done on it is string based. There were queries that took ~7 hours to process one day’s worth of auditing, when I got to grips with them they were running a month’s worth of data inside 8 minutes. This came around from decent normalisation combined with using efficient indexes and tidier SQL. The other advantage is that when the new queries are running, other things can still operate on the server, which wasn't the case before!

My point is no-one should recommend de-normalised data, it's just laziness on understanding the real business problem and causes more issues when trying to ensure data quality, scalability and performance. But the other point is if you're going to volunteer something as a self-professed “expert”, give valid insight on why it worked for you and why you think you're right. If you're using word of mouth, you don't have the right to volunteer an answer, but merely offer a point of reference for them to do their own research.

The person should be encourage to understand their system they are using, and appreciate that the combination of good design, good indexes and even index\ materialised views can make a system that can give exceptional performance. The design needs to be based on needs.



There are no comments.

Posting Comments is disabled.