SLAs are a guarantee that something will be done and there should be penalties in place when a violation occurs. In theory this should be a simple case of measurement, but all too often this is something that is overlooked. People define the SLAs but forget the old adage on KPIs that if you allow someone to measure their own KPIs they will be always be successful.
I've been looking for a simple demonstration of this problem for a while, most of them are specific to a given business so don't work generically. But thanks to the folks in Redmond I've now got a great example, it may or may not be there fault but its a good example of measurement problems.
I run Windows XP under Parallels on a MacBook Pro. Now I just do basic office work so I've given it a C: drive with 15GB. This should be a decent amount for the basic office files (Outlook files are stored on a dedicated share). The trouble is I've run out of space....
So what files are taking up the space? Well a quick "WinDirStat" on the C drive (after doing the same exercise with Windows properties) came up with the stat that the files on the drive take up around 5.4GB. This leaves me with around 10GB unaccounted for.
Here is an example of a producer/consumer SLA. The producer (Windows XP) has committed to provide me with around 15GB of storage for my office files. It then reports that I have violated the client SLA for the service and it will now perform like a dog. Equally my information is that the performance of the producer has severely degraded and I am unable to add more files despite being significantly below the agreed limit.
In this case it is the producer who measures the KPIs and even though the independent measure suggests it to be incorrect it is not possible to challenge the producers statement.
What this says is that when looking at KPIs and SLAs for services you need to think about independent measurement being part of the basic requirement. This implies that measurement is done at the service boundary by a 3rd party which must track these SLAs over time. Otherwise you'll just end up with Windows saying "disk space full please free up some space" and then telling you that you don't have enough files available to make a dent in the space.
The final piece therefore is around arbitration. It is quite possible in the setup that I'm using that something is going screwy around Windows that isn't the fault of the producer or the consumer its due to a 3rd party (e.g. the VM) once you get into this stage you need to think about arbitration. The challenge here is that the consumer is still due compensation from the producer, but the producer may have a counter claim against the 3rd party. This is the final piece about SLAs in a professional SOA environment. Its important to think "back to back" in your SLAs otherwise they won't actually mean anything. If a service commits to responding in 20ms but none of the things it relies upon will make any such guarantee then its just corporate optimism (at best) or fraud (at worst).
SLA management is a core part of the shift of SOA away from technology and into the business domain. With all the WS-* arguments going around I'm still stunned that there is no WS-Contract or WS-SLA because its that which would really separate WS-* from other technical choices.
SLAs are about
- Defining the terms
- Defining the penalities
- Measuring the operation
- Arbitrating the violation
- Spreading the risk down the chain
(Oh and if anyone has any idea about the 10GB I'd love to know where its gone!)