Friday, January 02, 2009

SOA resiliance and the power of virtual machines

One of the things that continues to amaze me when I look at companies Disaster Recovery policies is how much they concentrate on the Backup and how little they concentrate on the restore. Chatting to a CIO in the manufacturing industry a while back he gave me a great stat on his business
For every minute that we are down it costs 200k euros a minute, my DR plan takes two days, that is why I have three redundant data centres and a full set of passive backups.

Simply put this chap couldn't let his systems go down so he invested very heavily in making sure that they didn't.

More often however people just tick the "backup" box, send it off to tape and then don't worry about bringing a service backup and what that really takes.

If you've lost a disk then its a physical job followed by a data restore (you did of course test that). If the server is trashed then you have procurement, install and then recovery. If the Data Centre is trashed then you have a much bigger challenge.

This is where companies come in and charge quite a bit of money to have "warm" servers on standby. You pay for them at a certain rate (and more when you actually use them) and then have to do all the rebuild job if disaster strikes.

Now in a networked world such as SOA this is a big issue as the failure of one service can have significant knock-on effects. Now you need to design around this, but there is still the question of the quickest way to get a degraded instance of the service back.

Why degraded? Well lets say its a high demand service, you've gone stateless, you've got 20 Linux boxes running it horizontally when a muppet manages to dig up the network cable into your data centre. Its going to be 3 days to get it fixed to 100% but you need something degraded that works now.

This is where Virtual Machines really kick-in. Along side your normal data backup strategy I'd recommend taking a Virtual Machine backup of the server. Now in future the VM approach will probably be the normal one, but its got a great job right now as part of your DR solution. Take the VM backup at the same time and then if you need to just fire it up on some commodity hardware that you have lying around. Its going to perform badly so think about throttling, or fire up a virtual grid with the hardware in the office. This then gives you the space to do the full recovery and get everything up and performing at 100%. Having the VM backup means putting your patch in place as quickly as possible.

When you do this however I'd also recommend that you run, at least once a month, a full unit/system test on the VM backup to make sure that it does actually work properly.

Disaster Recovery is about planning for the recovery, not planning for backup.

Technorati Tags: ,

No comments: