Thursday, July 11, 2013

Google and Yahoo have it easy or why Hadoop is only part of the story

We hear lots and lots of hype at the moment around Hadoop, and it is a great technology approach, but there is also lots of talk about how this approach will win because Google and Yahoo are using it to manage their scale and thus this shows that their approach is going to win in traditional enterprises and other big data areas.

Lets be clear, I'm not saying Hadoop isn't a good answer for managing large amounts of information what I'm saying is that Hadoop is only part of the story and its arguably not the most important.  I'm also saying that Google and Yahoo have a really simple problem they are attempting to fix, in comparison with large scale enterprises and the industrial internet they've got it easy.  Sure they've got volume but what is the challenge?
  1. Gazillions of URIs and unstructured web pages
  2. Performant search
  3. Serving ads related to that search
I'm putting aside the gmails and google apps for a moment as those aren't part of this Hadoop piece, but I'd argue are, like Amazon, more appropriate reference points for enterprises looking at large scale.

So why do Google and Yahoo have it easy?

First off while its an unstructured data challenge this means that data quality isn't a challege they have to overcome.  If google serve you up a page when you search for 'Steve Jones' and you see the biology prof, sex pistols guitarist and Welsh model and you are looking for another Steve Jones you don't curse google because its the wrong person, you just start adding new terms to try and find the right one,  if Google slaps the wrong google+ profile on the results you just sigh and move on.  Google don't clear up the content.

Not worrying about data quality is just part of the not having to worry about master data and reference data challenge.  Google and Yahoo don't do any master data work or reference data work, they can't as their data sets are external.  This means they don't have to set up governance boards or operational process changes to take control of data, they don't need to get multiple stakeholders to agree on definitions and no regulator will call them to account if a search result isn't quite right.

So the first reason they have it easy is that they don't need to get people to agree.

The next reason is something that Google and Yahoo do know something about and that is performance, but here I'm not talking about search results I'm talking about transactions, the need to have a confirmed result.  Boring old things like atomic transactions and importantly the need to get back in a fast time.  Now clearly Google and Yahoo can do the speed part, but they have a wonderful advantage of not having to worry about the whole transactions stuff, sure they do email at a great scale and they can custom develop applications to within an inch of their life...  but that isn't the same as getting Siebel, SAP and old Baan system and three different SOA and EAI technologies working together.  Again there is the governance challenge and there is the 'not invented here' challenge that you can't ignore.  If SAP doesn't work the way you want... well you could waste time customising it but you are better off working to what SAP does instead.

The final reason that Google and Yahoo have it easy is talent and support.  Hadoop is great, but as I've said before companies have a Hadoop Hump problem and this is completely different to the talent engines at Google and Yahoo.  Both pride themselves on the talent they hire and that is great, but they also pay top whack and have interesting work to keep people engaged.  Enterprises just don't have that luxury, or more simply they just don't have the value to hire stellar developers and then also have those stellar developers work in support.  When you are continually tuning and improving apps like Google that makes sense, when you tend to deliver into production and hand over to a support team it makes much less sense.

So there are certainly things that enterprises can learn from Google and Yahoo but it isn't true to say that all enterprises will go that way, enterprises have different challenges and some of them are arguably significantly harder than system performance as they impact culture.  So Hadoop is great, its a good tool but just because Google and Yahoo use it doesn't mean enterprises will adopt it in the same way or indeed that the model taken with Google and Yahoo is appropriate.  We've already seen NoSQL become more SQL in the Hadoop world and we'll continue to see more and more shifts away from the 'pure' Hadoop Map Reduce vision as Enterprises leverage the economies of scale but do so to solve a different set of challenges and crucially a different culture and value network.

Google and Yahoo are green field companies, built from the ground up by IT folks.  They have it easy in comparison to the folks trying to marshall 20 business divisions each with their own sales and manufacturing folks and 40 ERPs and 100+ other systems badly connected around the world.



1 comment:

Unknown said...

Sorry, late to read this, but, couldn't agree more. Specifically around the "transaction" part of this. If you do a search on google they tell you they found "about 100,000 results". Let me know if you know of any sales folks that would be okay if I processed "most" of the commission dollars they were due.