Thursday, April 19, 2007

Five Nines in a Service Oriented World - Part 2(d) - Understand the time criticality of information

Sure its great to have the latest prices, or the latest stock figure, but what do we mean by latest? What it was yesterday? An hour ago? A second ago? And sure its great when newer systems are able to give more accurate information than the old batch processing overnight information in the data warehouse solutions of a few years ago. But just because you now can get the information in "real"-time doesn't mean that if you can't get it right now that you have to keel over and die. Sure if you are doing a market trade on the stock-market its essential to have the price as accurately as possible, but what about if you can't get to the product catalogue for a car maker, is it okay to show the same price as yesterday?

Caching is your friend in this world, and this isn't just caching in the sense of "don't ask again for 10 minutes" ala HTTP, its much more about "this information is valid for 1 hour, but feel free to ask me if I'm available". This is another form of failure tree, in this model the request is made for the "real"-time information, if that succeeds it is cached, if it fails a check is made to the cache to see if the information is in there and if that information is still viable. Understanding the impact of this cached information, for instance a cached stock-level might mean that a customer won't get their product in 3 days time, is the important piece here.

The way I tend to think about these cases is in two basic scenarios. The first is easy, namely "its in real time" and that is what everyone is striving towards today, the second however is what people are moving away from which is "imagine if it was still nightly batch". You can include finer degrees of granularity if you need to but I find considering it in those two ways helps you understand what information is really time critical and what information is just time convenient.

The goal here is then to design the system so that either the time of information is irrelevant (unlikely) or the system has a decent set of tolerances around what different time-scales mean. I'd recommend in this state having a set of bands which indicate the boundaries and detailing the impacts at each of those levels. So for example
  1. "Real"-Time - perfect operation
  2. 0-10 minute delayed - 95% probability of accuracy, don't inform customer unless there is an exception
  3. 10-120 minute delayed - 75% probability of accuracy problems, inform the customer they will get a later email confirming status if it changes
  4. 120+ - Pot luck on whether it works, tell the customer to expect an email in the next 24 hours with their delivery date.
Not having real-time information shouldn't mean failure in lots of applications, but too often people take the easy "something failed so I should" route out. This again is different to the failure cases mentioned previously and the concept of minimal operation. Those deal with failures and have a mitigation, this is about coping with sub-optimal working.

Technorati Tags: ,

No comments: