Tuesday, April 17, 2007

Five Nines in a Service Oriented World - Part 2(b) - Planning for failure

At some stage in a services lifecycle one of the interactions that it makes will fail, therefore start from the assumption that it will fail and start thinking of the impact and cost of that failure. There are four basic scenarios for failure of an invocation
  1. Catastrophic - Failure of the invocation renders the consumer invalid, in other words future calls to the consumer (not the service) are invalid as a result of this failure. Its a really bad design position to get into when this happen but there are some cases where its possible (for example hardware failure), mostly its down to the stupidity of the people doing the design.
  2. Temporal Failure - Failure of the invocation renders the current processing of the consumer invalid. This is a very normal scenario, for instance when you run out of database connections or when a network request fails. This is the standard for most systems, it just propagates the failure all over the network.
  3. Degraded Service - Failure means the consumer can continue but is not operating at optimal capacity. This is a great place to be as it means the service is coping with failure and not propagating it around the network
  4. DCWC - Don't care, Won't Care - This is where the invocation was an embellishment or optimisation that can be safely ignored. Example here would be the introduction of a global logging system where the backup is to local file. From a consumers perspective there is no change in QoS or operation, there are some operational management implications but these have well defined work arounds and are not related to the core business operation.

The goal of a high-availability SOA is to get everything into the Degraded or DCWC categories. This means planning the work around right from the start. Part of the question is whether the high-availability is really business justified, if it is then its time to start understanding how to fail.
Primary, Secondary and Beyond
The easiest solution is to provide a fallback is to have redundant primary nodes. This is a very common solution for hardware failure (clusters) and can also be used to solve network connection issues. What might be done here is to have multiple primary services which are hosted in different environments, thus meaning if one environment fails that operation is seemless. The trouble is that redundant primary often is prohibitively expensive as it can require complex back-end synchronisation of data with all of the problems that this brings.

Next up is the idea of having secondary, or greater, services or routes. As a simple example lets take an order service that is trying to submit the order to the actual transactional system.
  1. Submit request via WS over internet - Failure on exception
  2. Submit request via WS over VPN - Failure on exception
  3. Submit request via WS over JMS - Failure on exception to connect
  4. Log to file - Failure on exception to connect
Here what we have is a variable QoS, the ISDN connection is slower, but uses the same technology and would be expected to contain the same responses. The JMS and log to file are both long term operational elements and so any expected responses cannot be assumed to happen within the required time period. So for order submission this might mean a lack of confirmation of the order and no ability to commit to a delivery time. That is looking at Primary/Secondary/etc from a network connectivity perspective but this isn't the only way to consider it. Other examples could be from a business perspective in terms of supplier preferences
  1. Primary - GeraldCash - Retry once on network failure, raise Critical on 2nd failure
  2. Secondary - FredPay - Try once, raise warning on failure
  3. Tertiary - Bank of Money - Retry once on network failure, raise Emergency on 2nd failure
  4. Final - Route user to call centre
Here we have a great financial deal with GeraldCash, and indeed we expect them to work all the time (hence Criticality is high) and them not being available will reduce the margin on a transaction and they will have to pay penalties. Next up is FredPay, its cheap enough but not the best from a reliability perspective, finally from a technical delivery perspective we have the Bank of Money, very reliable but the most expensive by a mile. If everything fails and we don't want to lose the transaction we could try and route it to the call centre for some offline processing.

So planning for failure means understanding what else could be done and what other calls could be made that would deliver the functionality required but within different bounds to the original request.

Planning for failure means its critical to protect your service from those that it is calling. Using a proxy pattern is an absolute minimum operational requirement, because its in that proxy (potentially via some infrastructure elements) that this failover can be done. You do not want the core business logic worrying about the failover tree, it just needs to be able to cope with the changing capabilities that are available to it. This means that you need to design the system to have a set of failure modes even if you can't see that being needed to day this doesn't mean you build the failure modes at this time but that you put the basics of the framework in place to enable it. And is a proxy really that much of an overhead? Nope I don't think so either.

Planning for failure should be a minimum for an SOA environment. This means planning for both functional and system failures and understanding the risks and mitigations that they present.

Technorati Tags: ,

No comments: