Ever since the relational database became king there has been a mantra in IT and information design. De-normalisation is critical to the effective use of information in both transactional and, particularly, analytical systems. The reason for de-normalisation is to do with the issues around read performance in relational models. De-normalisation is always an increase in complexity over the business information model and its done for performance reasons alone.
But do we need that anymore? For three reasons I think the answer is, if not already no, then rapidly becoming no. Firstly its to do with the evolution of information itself and the addition of caching technologies, de-normalisation's performance creed is becoming less and less viable in a world where its actually the middle tier that drives the read performance via caching and the OO or hierarchical structures that these caches normally take. This is also important because the usage of information changes and thus the previous optimisation becomes a limitation when a new set of requirements come along. Email addresses were often added, for performance reasons, as child records rather than using a proper "POLE" model, this was great... until email became a primary channel. So as new information types are added the focus on short term performance optimisations causes issues down the road directly because of de-normalisation.
The second reason is Big Data taking over in the analytical space. Relational models are getting bigger but so are approaches such as Hadoop which encourage you to split the work up to enable independent processing. I'd argue that this suits a 'normalised' or as I like to think of it "understandable" approach for two reasons. Firstly the big challenge is often how to break down the problem, the analytics, into individual elements and that is easier to do when you have a simple to understand model. The second is that grouping done for relational performance don't make sense if you are not using a relational approach to Big Data.
The final reason is to do with flexibility. De-normalisation optimises information for a specific purpose which was great if you knew exactly what transactions or analytics questions would be answered but is proving less and less viable in a world where we are seeing ever more complex and dynamic ways of interacting with that information. So having a database schema that is optimised for specific purpose makes no sense in a world where the questions being asked within analytics change constantly. This is different to information evolution, which is about new information being added, but is about the changing consumption of the same information. The two elements are most certainly linked but I think its worth viewing them separately. The first says that de-normalisation is a bad strategy in a world where new information sources come in all the time, the later says its a bad reason if you want to use you current information in multiple ways.
In a world where Moore's Law, Big Data, Hadoop, Columnar databases etc are all in play isn't it time to start from an assumption that you don't de-normalise and instead model information from a business perspective and then most closely realise that business model within IT? Doing this will save you money as new sources become available, as new uses for information are discovered or required and because for many cases a relational model is no-longer appropriate.
Lets have information stored in the way it makes sense to the business so it can evolve as the business needs, rather than constraining the business for the want of a few SSDs and CPUs.
But do we need that anymore? For three reasons I think the answer is, if not already no, then rapidly becoming no. Firstly its to do with the evolution of information itself and the addition of caching technologies, de-normalisation's performance creed is becoming less and less viable in a world where its actually the middle tier that drives the read performance via caching and the OO or hierarchical structures that these caches normally take. This is also important because the usage of information changes and thus the previous optimisation becomes a limitation when a new set of requirements come along. Email addresses were often added, for performance reasons, as child records rather than using a proper "POLE" model, this was great... until email became a primary channel. So as new information types are added the focus on short term performance optimisations causes issues down the road directly because of de-normalisation.
The second reason is Big Data taking over in the analytical space. Relational models are getting bigger but so are approaches such as Hadoop which encourage you to split the work up to enable independent processing. I'd argue that this suits a 'normalised' or as I like to think of it "understandable" approach for two reasons. Firstly the big challenge is often how to break down the problem, the analytics, into individual elements and that is easier to do when you have a simple to understand model. The second is that grouping done for relational performance don't make sense if you are not using a relational approach to Big Data.
The final reason is to do with flexibility. De-normalisation optimises information for a specific purpose which was great if you knew exactly what transactions or analytics questions would be answered but is proving less and less viable in a world where we are seeing ever more complex and dynamic ways of interacting with that information. So having a database schema that is optimised for specific purpose makes no sense in a world where the questions being asked within analytics change constantly. This is different to information evolution, which is about new information being added, but is about the changing consumption of the same information. The two elements are most certainly linked but I think its worth viewing them separately. The first says that de-normalisation is a bad strategy in a world where new information sources come in all the time, the later says its a bad reason if you want to use you current information in multiple ways.
In a world where Moore's Law, Big Data, Hadoop, Columnar databases etc are all in play isn't it time to start from an assumption that you don't de-normalise and instead model information from a business perspective and then most closely realise that business model within IT? Doing this will save you money as new sources become available, as new uses for information are discovered or required and because for many cases a relational model is no-longer appropriate.
Lets have information stored in the way it makes sense to the business so it can evolve as the business needs, rather than constraining the business for the want of a few SSDs and CPUs.
3 comments:
Can't say I agree. I think real business value is delivered via a multi pronged approach. It is feasible to
1) store your data in a relational and make use of all the advantages that come with that
2) have a complementary read only (denormalised) model and make use of all the advantages that come with that
3) use an event stream to populate both the relational and denormalised stores
And it is the event stream that adds value in a way that neither relational or denormalised stores can. It is undoubtedly complex but if you're looking for rich data that allows you to redefine your view an event stream is hard to beat.
But is it best to design even a relational database in a de-normalised way? Take my email example. There is a package vendor whose initial performance decision causes significant issues now. Had they taken an architecture (POLE) view then they'd be able to adapt and evolve quicker to the changing market, and so would their clients.
I'm just not to too sure that the performance requirements are justifying the down the track issues.
Hey Steve. I'm sure you see a lot of ugly arrangements - skeletons in the enterprise IT/IS closet.
I agree, folks should move from cheap workarounds to super-normalization.
For our part, we devolve all data, including code, into a graph which we relate at run-time to optimize normalization based on context. In that way data use can evolve with the business as you suggest.
Robin Bloor wrote a post on us about it - http://bit.ly/gnZ8N2.
Post a Comment