But isn't the "Web" really reliable? Isn't SOA going to make systems more reliable? Errr no it isn't if we carry on designing shttp://www2.blogger.com/img/gl.link.gifystems in the same bad old ways and just thinking that designing for distribution is the same as designing it in a single box. The Internet (as opposed to the Web) is reliable, but that is because it was designed to be, the Web wasn't.
The question raised at my meeting this week was "how many nines would a service need to have to support five nines at the system level" now the mathematically challenged out there (and I've seen them will do the following...
We need an average of five nines... but because of the "weakest" link principle its probably safest to have everything at five nines
And before anyone says "no-one would be that stupid" I've seen this sort of thinking on many occasions. So what is the real answer? Well lets assume a simple system of 6 services
To make this simple the central service A is the only one that relies on any other services and A's code and hardware is "perfect" therefore the only unreliability in A comes from its calls to the other 5 services. The question therefore is how reliable must they be for A to achieve five nines? There is a single capability that has to be enacted on each service and that capability results in change to the systems state (not just reads). The reliability of A is the combined probability of failure of all of the interactions:
So if we assume that all interactions are equally reliable (makes the maths easier) then to find the reliability of A we have:
Note that here we aren't including performance as a measure but it should be pretty clear that performance of a networked request is slower than a local request and that things such as reliability over a network have a performance cost.
Scenario 1 - WS without RM
In this scenario we have standard WS-I without Reliable messaging so we have two points of unreliability, the network and the service itself. This means there are ten interactions. Feeding that into the formula this means that each interaction (and therefore each service) needs to support six nines of reliability in order to deliver five nines for A.
Scenario 2 - WS with RM
Now using RM (or SOAP over JMS) we can partly eliminate network failure from our overall reliability so now we are thinking about give interactions which means 5 interactions which gives us a 99.9998% availability requirement on the services. Still pretty stunning (and a great example of diminishing returns).
Scenario 3 - REST no GET
In this scenario using REST its assumed that the URIs for the resources are already available and no searching is required to get to the resources, this means that returns from each service contain the link to the resource on the other service. This is exactly the same as the standard WS scenario so again its six nines required.
Scenario 4 - REST with GET
In this approach we are doing a GET first to check what the current valid actions are on the resource (dynamic interface) this means that we have twice the number of calls over the network which means twice the number of interactions. This gives a pretty stunning reliability requirement of 99.99995% for the services.
This of course is for a trivial example where we are talking about a single service which calls 5 services which don't call any other services. If we take some of the more "chatty" suggestions that are out there for the "right" way to implement services then we start on down a road where the numbers just get... well plain silly. If we assume that each of the 5 services itself calls 5 services each time its called then the impact goes up in Scenario 4 (6 x 20 interactions) to needing over seven nines of reliability in each of the services, to put that into perspective that is under 3 seconds of downtime a year. That just isn't sensible however so the real question with SOA is
How do I change my thinking to enable services to fail but systems to succeed?