Wednesday, May 06, 2009

Do clouds require redundancy?

Having a twitter debate with Dan Creswell and Andy Hedges one of the topics was what a cloud is and what is required. In 140 characters Andy tweeted
Cloud Computing: the commoditisation and associated automation in provisioning and management of execution contexts.
Which isn't bad in 140 characters. One of the questions though was what level of resilience and redundancy does a cloud need to provide and associated with that what type of failover should it be delivering.

Dan has made a good case of cloud being about the provisioning API and I agree that is at the heart of cloud computing, the automated provisioning of virtual machines is the underpinnings of cloud. But what else? If its simply about that provisioning and management then my iPhone has a cracking application that enables me to start (provision) and stop Parallels instances on my Mac. It doesn't enable me to provision or deploy new instances (although in theory it could) but I'd argue that my little MacBook Pro would struggle to be described as a cloud environment.

So what is important in cloud? The first piece I'd add to the 140 characters would be failover. If my Mac fails then the images are toast, a cloud should have a level of virtualised redundancy which means that if a single piece of hardware fails my image should not. This means that running images must not only be portable but also must be actively failed over. Either by continually running across multiple hardware instances or by some form of active backup.

If a "cloud" solution is still tied to the "box fails, you fail" then for me that is a FAIL in terms of it really moving us on into a virtualised world, if my redundancy is not now virtualised then a major part of my SLA isn't virtualised and I'm still considering individual failover as important.

The follow on question is what level of redundancy is required. Can it all be in a single rack where a back-plane failure could take it all down? Can it be in two different parts of the same DC so a major power outage and UPS failure takes it down? Should it be in multiple DCs in a tied pair or triplet where a major natural disaster like an earthquake could toast it or must it be geographically redundant.

I'd say that it will truly be cloud when it achieves the later but that the bar needs to be a little bit lower so I'll say there are two classes of clouds and multiple levels

Automatically resilient clouds
These are clouds which provide true virtualisation of compute resources and effectively eliminate the need for hardware failure considerations. Virtual Machines can still fail for software reasons (hello Blue Screen of Death) but hardware and storage failure are managed to a given level.
  1. Platinum Clouds - Automatic geographical redundancy
  2. Gold Clouds - Automatic multi-DC redundancy
  3. Bronze Clouds - Automatic in DC redundancy
  4. Stone Clouds - Automatic proximity redundancy (shared hardware infrastructure)
  5. Lead Clouds - None of the above
Managed Resilient Compute Grids
These are solutions, often termed clouds, where the application can fail for hardware reasons and where you need to manage against hardware failures. Software failures also still occur clearly but the main point is that from a management perspective you have exactly the same redundancy concerns as with a physical server.
  1. Platinum Compute Grids - Ability to manage and provision your images across multiple Geos
  2. Gold Compute Grids - Ability to manage and provision your images across multiple DCs in a single geo
  3. Silver Compute Grids - Ability to manage and provision your images within a DC
  4. Lead Compute Grids - Everything goes to the same set of racks
The point here is about what virtualisation means and what it provides. If a cloud still links you to hardware failure then its a great compute grid but it still requires you at a fundamental level to worry about the hardware whether that be CPU or disk. A cloud should virtualise that meaning that only software failures are the issue, the hardware has been truly virtualised and scaled.

Technorati Tags: ,


PetrolHead said...

"It doesn't enable me to provision or deploy new instances (although in theory it could) but I'd argue that my little MacBook Pro would struggle to be described as a cloud environment."

Indeed but I think now we're into the land of SLA's like around what is acceptable resilience come along:

(1) Loss of power in an appropriate sized geographical unit shouldn't be a problem.

(2) Trucks crashing into the "cloud facility" shouldn't be a problem.

(3) Loss of network connectivity shouldn't stop me from provisioning (though maybe I can't reach some execution contexts for a while).


I'd accept that many enterprises would like a cloud that provides "reliable" execution contexts, might even work at a certain level of application-scale but we all know that above that level, this isn't sustainable, it costs too much and doesn't actually deliver on the promise (stuff is always failing just because there's so much stuff).

PetrolHead said...

Oh and do please excuse my appalling typing :)

"Indeed but I think now we're into the land of SLA's like around what is acceptable resilience come along:"

How about:

Indeed but I think we're now into SLA's around what is acceptable resilience.

Andy Hedges said...

Doesn't look like my comment from yesterday posted. Perhaps a DC went down :p

In summary what you are describing is the SLAs and a cloud doesn't need to have any particular SLA to be a cloud other that the characteristics I described. In the same way a database doesn't need a level of uptime to be a database or an app server doesn't need a certain response time to be an app server. Sure they are nice things to have, critical in some cases but that doesn't alter the useful definition.

Most organizations using cloud type architectures will want specify differing SLAs from other orgs and most organisations supplying cloud 'solutions' will guarantee differing SLAs, that is why I feel it isn't useful to have this in the definition.

Steve Jones said...

Dan I agree its about the SLAs which is sort of what I'm aiming at with the levels of cloud. Smart folks can turn Lead into Gold themselves of course while the unwashed majority will struggle to even turn Gold into Gold.

Andy I'm talking about two different elements, one is around the SLA the other is around the concept. Clouds, in contend, have resilience above hardware, compute does not. Yes people will want different levels (Gold, Silver, Lead, etc) and different things (cloud, compute, etc) but my contention is that cloud needs to have the resilience. Smart people can build a cloud from compute, smart people are rare.

Ernest Moretti said...

Hi!! The thing is new to me. But after reading the whole write-up I got the thing clearly. Thanks for giving so much light on this topic. Two classes of clouds and multiple levels concept is bit hazy. Can you give some more info on this?