Sunday, January 16, 2011

Using REST and the cloud to meet unexpected and unusual demand

I’m writing this because its something that I recommended to a client about 3 years ago and I know they haven’t adopted it because they’ve suffered a number of outages since then. The scenario is simple, you’ve designed your website to cope with a certain level of demand and you’ve given yourself about 50% leeway to cope with any spikes. Arguably this means that you are spending 1/3 more on hardware and licenses than you need to but realistically its probably a decent way of capacity planning without getting to complex.

Now comes the problem though. Every so often this business gets unexpected spikes, these spikes aren’t a result of increased volume through the standard transactions but are a result of a peak on specific parts of their site, often new parts of the site related to (for instance) sales or problem resolution. The challenge is that these spikes are anything from 300% to 1000% over their expected peak and the site just can’t handle it.

So what is the solution? The answer is to use the power of HTTP and in particular the power of the redirect. I’m saying that this is REST but its something I’d done before I knew about REST but I’m not one to let a bit of reality to get in the way of marketing ;) When I’d done it previously it was prior to cloud but the architecture was basically the same.

First you split your infrastructure architecture into two parts

  1. The redirecting part (hosted in the cloud, or at least on a separately scalable part of your infrastructure)
  2. The bit that does the work


The redirect just sends an HTTP redirect (code 307 so it isn’t cached) to the new site, so lets say http://example.com goes to http://example.com/home its important to not here that this is the only page we are redirecting, its not a case that every page has this just the main page because when there is a mega-spike it tends to come via the homepage.

Now I’m always one to talk about being chatting but the wonder of a redirect is that the user sees a URL flicker in their browser and then the normal page loads. This is certainly an overhead of a single call but from experience this isn’t a big deal in modern sites where you have a page made up of multiple fragments, the additional redirect doesn’t add a significant amount and its only on the initial page load which now takes two network hits rather than one… its an increase in latency for that homepage but not much of an increase in terms of load time.

Now lets wander into the world of cloud, what does this get us then and why is it worth adding this overhead?

Well when you have an extraordinary event you should really think about creating new pages for it rather than just tacking pages onto your normal site, if you are in a scenario where 70-98% of your visitors are looking a specific piece of content then you are much better of thinking in terms of a microsite rather than adding it to your normal site.

All of the old URIs that go beyond the main page should still go to their old places but the home page needs to be redirected to your new microsite. Now some people will be screaming “just use a load balancer” and they have a bit of a point but I’ve always been a bit of a fan of offloading processing onto the client and this is exactly what the redirect does.

So now the redirect site uses the same template as the home site in case of CSS and key navigation but it doesn’t include all of the dynamic bits and fragments that were on the old front page it includes two things

  1. The information directly related to the extraordinary event
  2. Links off to the normal site

So now our original redirection goes from http://example.com to http://example.com/event and we scale the event part to our new demand. If its truly extraordinary then you are better off doing it as static pages and having people making the modifications (even if updates are ever 5 minutes then its cost wise a lot less than call centre staff). The point is simple you are scaling the extraordinary using the cloud.

So spotted the big point here? Its something that you can do with a traditional infrastructure and then make the shift to cloud for what cloud is real good at - handling spikes. You don’t have to redesign your current site to scale dynamically you just have to use a very simple policy and have a cloud solution that you can rapidly put up in the event of an massive spike.

A couple of hints:

  1. Have the images for the spike ready to go and monitor at the redirect level to automatically kick-in the spike protector
  2. Have an automatic process to dump a holding page onto the spike protector which tells people that more information is coming soon, they’ll tend to refresh rather than go to the rest of the site

You don’t need the normal commercial licenses as you can do it via static uploads (the normal site can do its dynamic magic on your old infrastructure) or a temporary OSS solution.

I'm often confused as to why people try and scale to meet extraordinary demand on a normal architecture, people seem to not realise that most spikes aren’t a result of your core business getting 500% more popular over night, its normally a result of a specific promotion or problem and its that specific area which needs scaling. If its a promotion you need to scale the people hitting that promotion and then look at either scaling the payment piece, putting in place a temporary process or throttling the requests through that part of the process. If its an issue then treat the site like a news site and statically publish updates.

So there you go by using the power of a simple command - “redirect” - you can take advantage of cloud quickly and effectively and if you never get the extraordinary event it doesn’t cost you much, if anything.

So get on with the power of redirect and link it to the power of the cloud because that is when technical things are actually interesting, when they can simply be used to solve a problem cheaply that previously was too expensive to solve.

Technorati Tags: ,

2 comments:

Anonymous said...

If you can (re)write the app to use REST (i.e not legacy) then why not go further and create stateless application services, define them as autoscaling arrays (one tier/template per service array) and use global load balancing to distribute traffic across multi-cloud (E.g Cloud.com private and burst into public) for your DC hardware? Single deployment templates of all apps, patching and operations.

A surprise amount of deployments achieve this every week :)
SP

Steve Jones said...

Because that is liable to be rather large, I'm just talking about taking any old website and using a bit of HTTP to add a spike protector. This is very cheap requires practically no re-engineering and protects you.

What you are talking about isn't a simple matter for most people to do, its nice (and arguably can be done with or with out REST) to do but often not worth the cost when you consider normal demand patterns.