- Catastrophic - Failure of the invocation renders the consumer invalid, in other words future calls to the consumer (not the service) are invalid as a result of this failure. Its a really bad design position to get into when this happen but there are some cases where its possible (for example hardware failure), mostly its down to the stupidity of the people doing the design.
- Temporal Failure - Failure of the invocation renders the current processing of the consumer invalid. This is a very normal scenario, for instance when you run out of database connections or when a network request fails. This is the standard for most systems, it just propagates the failure all over the network.
- Degraded Service - Failure means the consumer can continue but is not operating at optimal capacity. This is a great place to be as it means the service is coping with failure and not propagating it around the network
- DCWC - Don't care, Won't Care - This is where the invocation was an embellishment or optimisation that can be safely ignored. Example here would be the introduction of a global logging system where the backup is to local file. From a consumers perspective there is no change in QoS or operation, there are some operational management implications but these have well defined work arounds and are not related to the core business operation.
The goal of a high-availability SOA is to get everything into the Degraded or DCWC categories. This means planning the work around right from the start. Part of the question is whether the high-availability is really business justified, if it is then its time to start understanding how to fail.
Primary, Secondary and Beyond
The easiest solution is to provide a fallback is to have redundant primary nodes. This is a very common solution for hardware failure (clusters) and can also be used to solve network connection issues. What might be done here is to have multiple primary services which are hosted in different environments, thus meaning if one environment fails that operation is seemless. The trouble is that redundant primary often is prohibitively expensive as it can require complex back-end synchronisation of data with all of the problems that this brings.
Next up is the idea of having secondary, or greater, services or routes. As a simple example lets take an order service that is trying to submit the order to the actual transactional system.
- Submit request via WS over internet - Failure on exception
- Submit request via WS over VPN - Failure on exception
- Submit request via WS over JMS - Failure on exception to connect
- Log to file - Failure on exception to connect
- Primary - GeraldCash - Retry once on network failure, raise Critical on 2nd failure
- Secondary - FredPay - Try once, raise warning on failure
- Tertiary - Bank of Money - Retry once on network failure, raise Emergency on 2nd failure
- Final - Route user to call centre
So planning for failure means understanding what else could be done and what other calls could be made that would deliver the functionality required but within different bounds to the original request.
Planning for failure means its critical to protect your service from those that it is calling. Using a proxy pattern is an absolute minimum operational requirement, because its in that proxy (potentially via some infrastructure elements) that this failover can be done. You do not want the core business logic worrying about the failover tree, it just needs to be able to cope with the changing capabilities that are available to it. This means that you need to design the system to have a set of failure modes even if you can't see that being needed to day this doesn't mean you build the failure modes at this time but that you put the basics of the framework in place to enable it. And is a proxy really that much of an overhead? Nope I don't think so either.
Planning for failure should be a minimum for an SOA environment. This means planning for both functional and system failures and understanding the risks and mitigations that they present.