Thursday, March 26, 2009

Trains and QoS

One of the challenges of designing a service is understanding the Quality of Service (QoS) both in terms of what it is, and importantly in how it is measured. The point about QoS is that it should be from the perspective of the consumer rather than from the producer. Historically people have ignored pieces like the network connectivity and latency but this is rapidly becoming a nonsense as people look at cloud computing and SaaS as standard parts of an IT infrastructure.

To use an analogy. I live about 30 miles or so from the office in Soho, Mike Morris lives in south london about 10 miles away. Now to open the door to get into the office takes about 10 seconds, and historically this is the SLA that would be given and the QoS would be measured from the point of standing outside the door to inside, a whole metre.

Now both of us are on train networks within walking distance at both ends, myself on the overground system, Mike on the underground. So in other words we both have network connectivity to the office.

Now here is the point, despite the fact that I live significantly farther away than Mike it takes us about the same time door-to-door because my train moves miles faster and stops much less than his underground train. This means from a QoS perspective our normal QoS is pretty much the same. This is where most people stop if they bother at all, measuring the end-to-end normal performance. However to really understand QoS and especially in a more critical area its important to look at what happens when things go wrong.

For Mike he has a couple of alternative routes if the main one fails, these take a bit longer (lets say 20-30% longer) and in the worst case scenario he could walk there in about 2 hours (a 120% increase), this means that from a disaster recovery perspective its not actually something that is over-worrying from a QoS perspective, its like switching from a dedicated commercial internet connection to ADSL or at worst to dialup, you can still get there but the performance sucks.

For me however we are in a different world of hurt, there are no alternative train routes so it would be either car (lets say at a 150% increase) or if that was buggered, for instance in the recent snow, then walking would take an impressive 11 hours or in other words it wouldn't work at all. Therefore I need an alternative solution, this comes in the form of home working where I have a duplication of the office environment (power, light, internet connectivity) and some additional software (VPN) to ensure I can continue working.

This is the point of QoS when looking at Cloud and SaaS and distributed SOA solutions. You've got to think about what happens if the main network routes fail, are there alternatives, can you put in an ISDN line for emergencies? What if the connection really is down and it can't be established. Can you have a local cache that will support minimal working? Can you degrade so you don't need to use that service?

The point about QoS is that you need to look at the failure conditions of the whole network not simply the last inch. Many of these elements might be out of your control and therefore you need to push back the QoS onto the consumer to make sure they are informed about what is their responsibility and what you will guarentee. This is what the cloud and SaaS providers do, but if your job is to ensure the business can use those solutions then you need to be looking at that end-to-end element and concentrating on the connectivity and caching more than thinking "the internet is always up and its always performant".

SouthEastern trains provide me with a great QoS on a good day and complete rubbish on rather a lot of days, if I was using a SaaS solution where the network issues were similar then my perception would be that it is the SaaS solution that is rubbish.

Perception is everything when rolling out new technologies and QoS is the rigour required to make sure the perception is positive.

Technorati Tags: ,


Rob Eamon said...

Most excellent analogy to illustrate your point.

This might well apply to internal apps/services too. How often does an intermediary get the blame for issues at the end points? How often does the intermediary cause the perception of an end point to be poor?

Alas, the folks that typically see the end result of the total system QoS have no idea what the total system really is. Thus, much time is often wasted in chasing issues in the wrong components.

For example, if you were late getting to the office, how many would assume the train was the cause? Would they even consider that perhaps for some reason you couldn't get in the door? Probably not until after it was proven that the train was not the cause.

GBUK said...

From the point of view of degradation, I note that all our apps using services in my organisation will simply stop working if one of the services is down. We haven't looked at QoS like that! Is degradation something you see often in your experience?

Steve Jones said...

Rob, I agree on the issues a standard response on one project I was on was "blame the middleware" and it was our job to find the real cause of the problem.

GBUK. Service degredation isn't normal but people like Amazon do it as standard and in things like air traffic control and other safety critical areas its more normal. The reality is that it should be normal these days but people struggle to cope with the idea of "no it really isn't there but you should still work".

Anonymous said...

architects in bangalore , interior designers in Bangalore , interior designers in Bangalore , architects in bangalore , architects in bangalore , interior designers in bangalore

Anonymous said...

Steve quality of service (QoS) should also factor in resource management (on-demand, in-flight reservation, service policing) and prioritization (critical or not, paid or not, value mgmt,...).

This has been pretty much solved at the network layer but only recently being adequately addressed at the application layer at least that is what we hope with the release of a QoS extension we added to our cloud/ent software resource metering (& costing) runtime technology.