Thursday, March 22, 2007

Five Nines in a Service Oriented World - Part 1 the problem

SOA systems are more prone to failure than traditional IT systems, SOA systems which rely on "the Web", HTTP or other network centric approaches are more prone to failure than those which rely on conceptual model and then implement locally. I'd say its staggering how Peter Deutch's 7 fallacies of network computing are being ignored in this latest technology driven approach (and I'm talking here about SOD IT rather than Business SOA) but unfortunately its not staggering, its exactly what is to be expected from an industry that loves to re-invent the wheel and where every 3-5 years it looks like yet another bunch of teenagers who "know best" are pushing the new "cool" approach. The teenagers are of course the vendors and the technology fan-boys.

But isn't the "Web" really reliable? Isn't SOA going to make systems more reliable? Errr no it isn't if we carry on designing shttp://www2.blogger.com/img/gl.link.gifystems in the same bad old ways and just thinking that designing for distribution is the same as designing it in a single box. The Internet (as opposed to the Web) is reliable, but that is because it was designed to be, the Web wasn't.

The question raised at my meeting this week was "how many nines would a service need to have to support five nines at the system level" now the mathematically challenged out there (and I've seen them will do the following...
We need an average of five nines... but because of the "weakest" link principle its probably safest to have everything at five nines
And before anyone says "no-one would be that stupid" I've seen this sort of thinking on many occasions. So what is the real answer? Well lets assume a simple system of 6 services

To make this simple the central service A is the only one that relies on any other services and A's code and hardware is "perfect" therefore the only unreliability in A comes from its calls to the other 5 services. The question therefore is how reliable must they be for A to achieve five nines? There is a single capability that has to be enacted on each service and that capability results in change to the systems state (not just reads). The reliability of A is the combined probability of failure of all of the interactions:

So if we assume that all interactions are equally reliable (makes the maths easier) then to find the reliability of A we have:

Note that here we aren't including performance as a measure but it should be pretty clear that performance of a networked request is slower than a local request and that things such as reliability over a network have a performance cost.

Scenario 1 - WS without RM
In this scenario we have standard WS-I without Reliable messaging so we have two points of unreliability, the network and the service itself. This means there are ten interactions. Feeding that into the formula this means that each interaction (and therefore each service) needs to support six nines of reliability in order to deliver five nines for A.

Scenario 2 - WS with RM
Now using RM (or SOAP over JMS) we can partly eliminate network failure from our overall reliability so now we are thinking about give interactions which means 5 interactions which gives us a 99.9998% availability requirement on the services. Still pretty stunning (and a great example of diminishing returns).

Scenario 3 - REST no GET
In this scenario using REST its assumed that the URIs for the resources are already available and no searching is required to get to the resources, this means that returns from each service contain the link to the resource on the other service. This is exactly the same as the standard WS scenario so again its six nines required.

Scenario 4 - REST with GET
In this approach we are doing a GET first to check what the current valid actions are on the resource (dynamic interface) this means that we have twice the number of calls over the network which means twice the number of interactions. This gives a pretty stunning reliability requirement of 99.99995% for the services.

This of course is for a trivial example where we are talking about a single service which calls 5 services which don't call any other services. If we take some of the more "chatty" suggestions that are out there for the "right" way to implement services then we start on down a road where the numbers just get... well plain silly. If we assume that each of the 5 services itself calls 5 services each time its called then the impact goes up in Scenario 4 (6 x 20 interactions) to needing over seven nines of reliability in each of the services, to put that into perspective that is under 3 seconds of downtime a year. That just isn't sensible however so the real question with SOA is
How do I change my thinking to enable services to fail but systems to succeed?


Technorati Tags: ,

11 comments:

Anonymous said...

As it happens I am writing a book on SOA patterns and I am currenly thinking about the same problem. The way I see it the situation is much worse than you describe
since the System's reliablity (overall of all the services not just one) is the multiplication of the reliabilty of each of the components so even if each of the 6 services in your example have a reliability of 5 nines the overall relaiability is 0.99999^6 which is lower
and things deteriorate further the more services you have

Jason English said...

Business Continuity is an interesting topic. As an SOA testing firm we like to make the distinction that "reliability" at a statistical uptime level, is only part of the challenge of Five Nines. As hardware and bandwidth become more commoditized, that becomes something you can buy fairly easily.

Now we are seeing companies thinking of 5 9's in a Functional sense: Can I trust the accuracy and integrity of the process. "Failure" isn't downtime of the connection or the app server, etc., it is a failure of the system to provide the right business logic.

In any case, Arnon, thanks for the discussion and we invite you guys to check out our whitepaper on this topic:

http://www.itko.com/site/resources/59s.jsp

Jason - iTKO, Inc.

Steve Jones said...

Arnon,

Its probably worth having a re-read of the post, I say that service reliability is the Product of the reliability of interactions, which means that as you say if everything is five nines then the actual reliability of A is much lower (in Scenario 1 and 3 it drops to 4 nines for instance).

Jason, I agree that accuracy is important, but its not true to say that availability in terms of hardware and bandwidth is "easy" to buy to meet five nines of interaction reliability (its part of the fallacies in fact) next time I'll cover off more about reliability but assuming that bandwidth and hardware won't fail is not a smart choice and its very expensive (look at Google) to make that sort of system reliable. I will read the paper though, cheers for that ref.

Anonymous said...

Steve,
Maybe I didn't explain myself well. What I said is that the system's reliability (where the system is the collection of several services) is the product of the individual services regardless of their interactions.

Unknown said...

Aren't you missing the probability that a service will be called by the system? Your model supposes not only that each service is equally reliable, but that N services will be called at 1/n probability.

Won't this will perhaps mitigate the reliability problem somewhat since (I'm assuming) your actual service usage will track somewhere to the 80/20 rule?

Steve Jones said...

arnon,

Everytime you call a service there is a probability that it will not be available, I'm viewing the reliability as the probability, hence the product of interactions because evertime you call a service it could fail, especially when you factor in potential network failure. Now I can see the simple product view, but I tend to err on the side of caution when looking at reliability and think of that window of failure that the service has as being the basis for the calculation. Everytime you call the service there is a 0.001% chance of failure... hence product of interactions.


mpeter, in this model I am indeed assuming everything in flat (it makes the maths easier) but I would say that a network of services that 5 service calls is pretty low and that if you are planning for five nines that an 80/20 rule is something to be applied to contingency and not to calculation.

Anonymous said...

See also an interesting article "The Fractal Nature of Web Services" by Christoph Bussler from Cisco Systems in the March 2007 issue of IEEE Computer magazine. He describes some similar issues with what he describes as the "naive" view of SOA.

You may also wish to consider the impact of lots of dependencies between services from the coupling perspective. Whilst we may strive for "loose coupling" between individual services, the overall effect in the presence of many services is actually quite "tight".

These are just some of the reasons I've *not* been pushing my clients too heavily towards SOA over the past couple of years.

-Mike Glendinning.

Anonymous said...

I agree that this is a complex architectural problem, but it's not always a simple mathematical one. For example, if the availability of a set of services is highly correlated, for example they are a set of services resurfaced from a single legacy application, the availability of the composite will not be a simple product of the individual service availablity since there is a high likelihood if a single service is down they will all be down.

It is also common for a set of services to be supported by the same infrastructure. For the supporting infrastructure you can use asynchronous messaging at times, which decouples the dependences and changes the equation. Also, if you are using SOA for integration of distributed systems, then the equation would likely be similar for other integration strategies, if the integration has to be synchronous.

You can use fault-tolerance, load-balanced web services supported by clustering technology that in fact raises the availability factor versus other integration strategies. We have solved the product of the components problem with other technologies, for example we don't think of the reliability of a car in terms of the product of the components. A car has thousands of components yet we expect it to start more than 99% of the time.

This post is related:

http://blogs.ittoolbox.com/eai/business/archives/capacity-planning-for-the-soa-infrastructure-7699

Eric Roch

chuckls said...

Great analysis Steve but, it gets even worse when you factor in the components of a single service. The service will definitely depend on a web server and possibly an application server and a database server or ERP system and its components plus the operating system(s) used. The reliability of each one of these systems is a factor in the total reliability product of a service.

Kirk DaCosta said...

In this discussion, as in many others on SOA, it's important to keep in mind that SOA is best employed as a coarse-grained strategy.

It is also important when one is designing for, measuring, or attempting to guarantee a certain quality of service, you determine what portions of a composite system are within your scope of influence and control.

From the tone of this article, one gets the impression that SOA is being considered in the context of physical design for intra-application functionality, that all the endpoints of the SOA system are within the control of the architect and implementers, and that SOA implies Web Services over HTTP. Under those assumptions, I would suggest that SOA is a poor choice.

There may be a case to be made that SOA implementations are inherently less reliable than “five nines”, but the case also needs to be made that SOA can guarantee a higher level of reliability than otherwise possible, particularly in business to business applications and in cases where several applications have need for similar data or functionality.

It may well be ill-advised to offer a blanket service level guarantee of “five nines” to the stakeholders of an SOA system, but when SOA is appropriately employed, it should be fairly easy to make the case that SOA reduces the risk and complexity of a system that must *co*operate in today’s dynamic, interconnected world.

Anonymous said...

Its a very nice blog for...
architects in bangalore , architects in bangalore , interior designers in Bangalore , interior designers in Bangalore , architects in bangalore , architects in bangalore , interior designers in bangalore