Showing posts with label caching. Show all posts
Showing posts with label caching. Show all posts

Wednesday, January 29, 2014

There is no Big Data without Fast Data

Which came first Big Data or Fast Data?  If you go from a hype perspective you'd be thinking Hadoop and Big Data are the first with in-memory and fast coming after it.  The reality though is the other way around and comes from a simple question:
Where do you think all that Big Data came from?
When you look around at the massive Big Data sources out there, Facebook, Twitter, sensor data, clickstream analysis etc they don't create the data is massive systolic thumps.  They instead create the data in lots and lots of little bits, a constant stream of information.  In other words its fast data that creates big data.  The point is that historically with these fast data sources there was only one way to go: do it in real-time and just drop most of the data.   So take a RADAR for instance, it has a constant stream of analogue information streaming in (unstructured data) which is then processed, in real-time, and converted into plots and tracks.  These are then passed on to a RADAR display which shows them.

At no stage is this data 'at rest' its a continue stream of information with the processing being done in real-time.  Very complex analytics in fact being done in real time.

So why so much more hype around Big Data?  Well I've got a theory.  Its the sort of theory that explains why Oracle has Times Ten (an in-memory database type of thing) and Coherence (an in-memory database type of thing) and talks about them in two very different ways.  On the middleware side its Coherence and it talks about distributed fast access to information and processing those events and making decisions.  Times Ten sits in the Data camp so its about really fast analytics... what you did on disk before you now do in memory.

The point is that these two worlds are collapsing and the historical difference between a middleware centric in-memory 'fast' processing solution and a new generation in-memory analytical solution are going to go away.  This means that data guys have to get used to really fast.  What do I mean by that?

Well I used to work in 'real real-time' which doesn't always mean 'super fast' it just means 'within a defined time ... but that defined time is normally pretty damned fast'.  I've also worked in what people consider these days in standard business to be real-time - sub-micro second response times.  But that isn't the same for data folks, sometimes real-time means 'in an hour', 'in 15 minutes', 'in 30 seconds' but rarely does it mean 'before the transaction completes'.

Fast Data is what gives us Big Data, and in the end its going to be the ability to handle both the new statistical analytics of Big with the real-time adaptation of Fast that will differentiate businesses.  This presents a new challenge to us in IT as it means we need to break down the barriers between the data guys and middleware guys and we need new approaches to architecture that do not force a separation between the analytical 'Big' and the reactional 'fast' worlds.

Saturday, October 18, 2008

Now we're caching on GAs

GAs = Google AppEngineS. Rubbish I know but what the hell. So I've just added in Memcache support for the map so now its much much quicker to get the map. Now the time taken to do these requests isn't very long at all (0.01 to do the mandelbrot, 0.001 to turn it into a PNG) but the server side caching will speed up the request and lead to it hitting the CPU less often...

Which means that we do still have the option of making the map squares bigger now as it won't hit the CPU as often, which means we won't violate the quota as often.

So in other words performance tuning for the cloud is often about combining different strategies rather than one strategy that works. Making the squares bigger blew out the CPU quota, but if we combine it with the cache then this could reduce the number of times that it blows the quota and thus enables it to continue. This still isn't effecting the page view quota however and that pesky ^R forces a refresh and the 302 redirect also makes sure that its still hitting the server, which is the root of the problem.

Technorati Tags: ,

Friday, October 17, 2008

Caching strategies when redirect doesn't work

So the first challenge of cachability is solving the square problem. Simply put the created squares need to be repeatable when you zoom in and out.

So the calculation today is just find the click point and find the point halfway between that and the bottom left to give you the new bottom left.

The problem is that this is unique. So what we need to find is the right bottom left that is nearest to the new bottom left.


Now one way to do this would be to do a 301 redirect for all requests to the "right" position. This is a perfectly valid way of getting people to use the right resource and of limiting the total space of the resource set. What you are saying in effect is that a request for resource X is in fact a request for resource Y and you should look at the new place to get it. This works fine in this scenario but for one minor problem.

The challenge we have is page views and a 301 redirect counts as a page view, meaning that we'd be doubling the number of page views required to get to a given resource. Valid though this strategy is therefore it isn't the one that is going to work for this cloud application. We need something that will minimise the page views.

But as this is a test.... lets do it anyway!

Technorati Tags: ,

Thursday, October 16, 2008

HTTP Cache and the need for cachability

One of the often cited advantages of REST implemented in HTTP is the easy access to caching which can improve performance and reduce the load on the servers. Now with breaking app engine quota regularly around Mandel Map the obvious solution is to turn on caching. Which I just have by adding the line

self.response.headers['Cache-Control'] = 'max-age=2592000'


Which basically means "don't come and ask me again for a week". Now part of the problem is that hitting reload in a browser forces it to go back to a server anyway but there is a second and more important problem as you mess around with the Map. With the map, double click and zoom in... then hold shift and zoom out again.



Notice how it still re-draws when it zooms back out again? The reason for this is that the zoom in calculation just works around a given point and sets a new bottom left of the overall grid relative to that point. This means that every zoom in and out is pretty much unique (you've got a 1 in 2916 chance of getting back to the cached zoom out version after you have zoomed in).

So while the next time you see the map it will appear much quicker this doesn't actually help you in terms of it working quicker as it zooms in and out or in terms of reducing the server load for people who are mucking about with the Map on a regular basis. The challenge therefore is designing the application for cachability rather than just turning on HTTP Caching and expecting everything to magically work better.

The same principle applies when turning on server side caching (like memcache in Google App Engine). If every users gets a unique set of results then the caching will just burn memory rather than giving you better performance, indeed the performance will get slower as you will have a massively populated cache but have practically no successful hits from requests.

With this application it means that rather than simply do a basic calculation that forms the basis for the zoom it needs to do a calculation that forms a repeatable basis for the zoom. Effectively those 54x54 blocks need to be the same 54x54 blocks at a given zoom level for every request. This will make the "click" a bit less accurate (its not spot on now anyway) but will lead to an application which is much more effectively cachable than the current solution.

So HTTP Cache on its own doesn't make your application perform any better for end users or reduce the load on your servers. You have to design your application so the elements being returned are cachable in a way that will deliver performance improvements. For some applications its trivial, for others (like the Mandelbrot Map) its a little bit harder.


Technorati Tags: ,