Friday, January 31, 2014

Java and Analytics the next frontier

I've been pretty verbal about Java going down the wrong path and my view that what Java should do is start having a 'core' which is just the real basics of the VM and the language and then a few profiles which specify what needs to be loaded, with the rest coming in on-demand based on the requirements of a given project.  The old 'it needs to have everything so the browser/desktop/etc' is just rubbish these days with mobility, HTML 5 and generally lots of other ways to get that work done.

There is one area that Java really needs to step up to the mark, that is around Analytics.  I remember when Hibernate became the Java defacto standard for database access.  Brilliant thing it was to.

Now however plain of data access is just one of the problems.  Languages like R, technologies like MadLib are beginning to help move way beyond database access and into complex analytics.  Real-time analytics is another area.  Almost all of these approaches rely on Java in some way but its hard to see how the current approach of JavaSE 8 is actually supporting this new generation of challenges.

Java has a real opportunity to leverage its massive install base to offer real new opportunities through the growth of analytics.

Only if its leadership can start leading.

Wednesday, January 29, 2014

There is no Big Data without Fast Data

Which came first Big Data or Fast Data?  If you go from a hype perspective you'd be thinking Hadoop and Big Data are the first with in-memory and fast coming after it.  The reality though is the other way around and comes from a simple question:
Where do you think all that Big Data came from?
When you look around at the massive Big Data sources out there, Facebook, Twitter, sensor data, clickstream analysis etc they don't create the data is massive systolic thumps.  They instead create the data in lots and lots of little bits, a constant stream of information.  In other words its fast data that creates big data.  The point is that historically with these fast data sources there was only one way to go: do it in real-time and just drop most of the data.   So take a RADAR for instance, it has a constant stream of analogue information streaming in (unstructured data) which is then processed, in real-time, and converted into plots and tracks.  These are then passed on to a RADAR display which shows them.

At no stage is this data 'at rest' its a continue stream of information with the processing being done in real-time.  Very complex analytics in fact being done in real time.

So why so much more hype around Big Data?  Well I've got a theory.  Its the sort of theory that explains why Oracle has Times Ten (an in-memory database type of thing) and Coherence (an in-memory database type of thing) and talks about them in two very different ways.  On the middleware side its Coherence and it talks about distributed fast access to information and processing those events and making decisions.  Times Ten sits in the Data camp so its about really fast analytics... what you did on disk before you now do in memory.

The point is that these two worlds are collapsing and the historical difference between a middleware centric in-memory 'fast' processing solution and a new generation in-memory analytical solution are going to go away.  This means that data guys have to get used to really fast.  What do I mean by that?

Well I used to work in 'real real-time' which doesn't always mean 'super fast' it just means 'within a defined time ... but that defined time is normally pretty damned fast'.  I've also worked in what people consider these days in standard business to be real-time - sub-micro second response times.  But that isn't the same for data folks, sometimes real-time means 'in an hour', 'in 15 minutes', 'in 30 seconds' but rarely does it mean 'before the transaction completes'.

Fast Data is what gives us Big Data, and in the end its going to be the ability to handle both the new statistical analytics of Big with the real-time adaptation of Fast that will differentiate businesses.  This presents a new challenge to us in IT as it means we need to break down the barriers between the data guys and middleware guys and we need new approaches to architecture that do not force a separation between the analytical 'Big' and the reactional 'fast' worlds.

Tuesday, January 28, 2014

EDW in the Library with Single Canonical Form - get a clue about killing the business

The game Cluedo (or just plain Clue in North America) is about discovering which person committed the murder, in what room using what.  What is amazing is that in IT we have the easiest game of Cluedo going and yet over and over again we murder the poor unfortunate business in the same way, then stand back and gasp 'I didn't know that would kill them'.

I talk about the EDW, the IT departments hammer to which every question of 'I don't have the information I need' looks like a nail.  The EDW is the murderer of information agility, the constrainer of local requirements and the heavy weight bully of the data landscape.  But its weapon of choice is more blunt than the lead pipe - the Single Canonical Form.  The creation of which requires compromise, limitation and above all a bloody indifference to the actual local needs of business users.

An EDW is normally actually only trying to answer a question at a high level of corporate consistency, so financial roll-up, a bit of a horizontal view around customer and maybe some views around procurement... although the latter is normally better done on its own.  The point is that it really isn't Enterprise beyond the fact that Enterprise is a lie.  It can be really good at doing that top level view, of creating a corporate data mart but the effort that it requires to do so often stifles the agility in local business units and chokes the throat of local information initiatives.

The good news is that IT didn't do this for completely bloody minded reasons, it did it because IT had constraints, data storage costs first amongst them and IT had a hard wall between the world of the operational transaction and the world of post-transactional analytics.  So the EDW worked in that limited space and with those restrictions.

The challenge now is that the restrictions of gone, storage  costs are now amazingly low when looking at Hadoop and the wall between operations and analytics has gone, with operations being the primary place that analytics is new able to deliver insight at the point of action.  This was the thinking that I put into the Business Data Lake an approach that matches the business environment, leverages the thinking behind Business SOA applied to data.

So lets put down the EDW, lets walk away from the single canonical form and get a clue.  Its going to take time for IT departments, as well as analysts, vendors and consultants, to be weaned off the EDW drug but I firmly feel that in 5 years time we will be looking at a world where the IT department no longer says:

"You need an EDW, lets design the schema, should be ready early next year"

and instead says

"Sure, I'll knock up a solution in the BDL for you, be ready on Tuesday"

The customer is always right, and our customer in IT is telling us that its the local view that counts... so can we stop battering them with a global view that doesn't fit their local problem.

Monday, January 27, 2014

Six things to make your Big Data project succeed

So I wrote about why your Hadoop project will fail so I think its only right that I should follow up with some things that you can do to actually make the Big Data project you take on succeed.  The first thing you need to do is stop trying to make 'Big Data' succeed and instead start focusing on how you educate the business on the value of information and then work out how to deliver new value... that just so happens to be delivered with Big or Fast Data technologies

Don't try and change the business
The first thing is to stop trying to see technology as being a goal in itself and complaining when the business doesn't recognise that your 'magic' technology is the most important thing in the world.  So find out how the business works, look at how people actually work day to day and see how you can improve that.

Sounds simple?  Well the good news is that it is, but it means you need to forget about technology until you know how the business works.

Explain why Information Matters
The next bit after you've understood the business better is to explain to them why they should care about information.  Digitization is the buzzword you need to learn, folks like MIT Sloan (Customer Facing Digitization), Harvard Business Review (are you ready for Digitization?) and Davos (Digitization and Growth) are saying that this is the way forwards.  And what is Digitization? In the raw its just about converting stuff into digital formats, but the reality is that what its about is having an information and analytical driven business.  The prediction of all the business schools is that companies that do this will out perform their competition.

This is an important step, its about shifting Information from being a technology and IT conversation towards the business genuinely seeing information as a critical part of business growth.  Its also about you as an IT professional learning how to communicate technology changes in the language the business wants to hear.  They don't want to hear 'Hadoop' they want to hear 'Digitization'.

Find a problem that needs a new solution
The next key thing is finding a problem that isn't well served by your current environments.  If you could solve a problem by just having a new report on an EDW then it really doesn't prove anything to use new technologies to do that in a more time consuming way.  The good news is there are probably loads of problems out there not well served by your current environments.  From volume challenges around sensor data, click stream through to real-time analytics, predictive analytics through data discovery and ad-hoc information solutions there are lots of business problems.

Find that problem, find the person or group in the business that cares about having that problem solves and be clear about what the benefits of solving that problem are.

Get people with the 'scars and ribbons'
What do I do when I work with a new technology?  Two things, firstly I get some training and from that build something for myself that helps me learn.  If I'm doing it at work in building a business I then go and find someone who has already done this before and hire them or transfer them into my team.

Bill Joy once said that the smartest people weren't at Sun so they should learn from outside.  I'm not Bill Joy, you aren't Bill Joy, so we can certainly learn from outside.  Whether this means going to a consultancy who has done it before, hiring people in who have done it before doesn't really matter.  The point is that unless you really are revolutionising the IT market you are doing something that someone has done before, so your best bet is to learn from their example.

It stuns me how many people embark on complex IT projects having never used the technology before and are then surprised that the project fails.  Get people with the 'scars and ribbons' who can tell you what not to do which is massively more important than what to do.

Throw out some of your old Data Warehouse thinking
The next bit is something that you need to forget, a cherished truth that no longer holds.  Get rid of the notion that your job as a data architect is to dictate a single view to the business.  Get rid of the thought that the cherished ETL process.  Land the data in Hadoop, all the data you can, don't worry if you don't think you might not use it, you are landing it Hadoop and then turning into the views or analytics.  There is no benefit in not taking everything across and lots of benefits for doing so.

In other words you've got the problem, that is the goal, now go and collect all the data but not worry about the full A-Z straight away by defining Z and working backwards.  Understand the data areas, drop that into Hadoop and then worry about what the right A-Z is today knowing that if its a different route tomorrow you've got the data ready to go without updating the integration.

Then if you have another problem that needs access to the same data don't automatically try and make one solution do two things.  Its perfectly ok to create a second solution to solve that problem on top of Hadoop.  You don't need everyone to agree on a single schema, you just need to be able to solve the problem.  The point here is that to get different end-results you need to start thinking differently.

Don't get hung up on NoSQL, don't get hung up on Hadoop
The final thing is the dirty secret of the Hadoop world that has rapidly become the bold proclamation - NoSQL really isn't for everyone and SQL is perfectly good for lots of cases.  Hive, Impala, HAWQ are all addressing exactly that challenge, and you shouldn't limit yourself to Hadoop friendly approaches, if the right way is to push it to your existing data warehouse from Hadoop... do it.  If the requirement is to have some fast data processing then do that.

The point here is your goal is to show how the new technologies are more flexible and better able to adapt to the business and how the new IT approach is to match what the business wants not to try and force an EDW onto it every time.

The point here is that making your Big Data program succeed is actually about having the business care about the value that information brings and then fitting your approach to match what the business wants to achieve.

The business is your customer, time do do what they want, not force an EDW down their throats.

Thursday, January 16, 2014

The People's Democratic Republic of IT

IT is a communist state in many organisations, one that believes in rigid adherence to inflexible approaches despite clear indications that they inhibit growth and a central approach to planning that Mao and Stalin would have thought is taking things a little too far. This really doesn't make sense in the capitalistic world of business and the counter-revolution is well under way. Its

I don't think the word 'Enterprise' is really worth anything in terms of something being a single standard Enterprise approach.  Whether that is Enterprise Resource Planning, Enterprise Data Warehouse, Enterprise Service Bus or Enterprise Architecture you either end up with multiple solutions or a central solution that isn't used to the level it was envisaged so you get lots of solutions on the side.

Part of this is because in the capitalistic world of business it appears that communist style central planning has been, and remains, the normal approach.  This People's Democratic Republic of IT approach has two key parts to it
  1. IT knows best and will give everyone 'each according to their needs' and decide what those needs are.
  2. Cultish following of other communist plans, independent of whether the users want them.

The world of integration is a great example of the latter.  Do you know how much the business cares about whether you integrate two systems using REST, SOAP,sockets or flying monkeysZERO.  Hell probably even less than zero in that they have an active disinterest in it.  Yet in IT we don't take this as a guidance of 'its not important, lets commoditise the fuck out of it'.  Nope we continue to 'innovate' where it really doesn't matter and we do so because a whole heap of hype tells us to... business hype?  Of course not, its hype from people who think they've discovered the universal hammer that turns everything into a nail.

On the former its the realm of 'Enterprise Architecture' and EDWs that really underline just how much IT often resembles the politburo.  Here groups of worthy individuals set about on the business equivalent of the Cultural Revolution or Stalin's grand plan for agriculture.  They just know that if everyone would just work in the same way then everything would be so much better.  So off they trot pushing a single solution and historically this was pushed all the way through to production and the business went:
"Well its not what I wanted but its a bit less shit than what I've got"
So IT created grand strategic plans (and I've said before there is no such thing as IT strategy) often in areas that the business really didn't care and off the business went and started using DropBox, and Amazon.

In effect the Shadow IT efforts of the business are analogous to the black market economies that often thrived in communist countries in the 80s.  Getting on doing what they need to do and being a lot more efficient than the state in doing it.  What we are seeing today is that as budget shift more and more towards the business the shadow IT market is getting bigger and bigger and the central planning has suddenly hit an issue.
The business understands technology
 Maybe not in the depth that IT does, but what the business understands is a bit more valuable
They understand how to focus on outcomes that add value, not technology hype.
So now as the Enterprise Architect says "you cannot do that, it is against our policy" the business says "stuff that for a game of soldiers, your policy doesn't work for us.".  The business is having its Berlin Wall moment, and while the IT communist state, the People's Democratic Republic of IT (because communist states love claiming they are democratic) might hold on for a while the reality is that the world is beginning to come crashing down.

Its time for IT to embrace capitalism, embrace value over technology and outcomes over acronyms.

Tuesday, January 07, 2014

How integration guys created a data security nightmare

There has been a policy in integration that has stored up a really great challenge of data security, and by great I don't mean 'fantastic' I mean 'aw crap'.  Its a policy that was done for the best of reasons and one that really will in future represent a growing challenge to Big Data and federated information.

The policy can be described as this:
Users authenticate with Apps, Apps authenticate with the database and Apps authenticate with the ESB/EAI/Integration
What this means is that Users don't authenticate against any of the federated data.  This is normally glossed over by saying that 'the source application is responsible for filtering' but the reality is that applications rarely do this terribly well.  Put it this way, most of the time when a front end banking system accessing your account information the only reason its getting your account data is because of the account ID it has, if it for some reason had the wrong account ID stored then it would display something else even though that information isn't yours.

This approach has tended to work for operational systems however because they work in linear ways on data sets, you know what you are showing and you pull out what you want to show.  The trouble is as these federated sets are then shifted into next generation Big Data solutions or accessed in a federated query is that the security model completely breaks down because application to application security doesn't recognise the individual actually making the request.

So now we have a world where the data sources don't do data security and privacy 'by design' they do it 'by functionality'.  So the reason you get your account information is because of that magic ID, but there is nothing stopping a piece of code saying 'return account information from account ID, account ID +1 and account ID +3. Its just practice that stops that rather than a fundamental information security approach.

There are mechanisms that can help with this but historically they've been more pain than its worth.  Its going to be interesting seeing how the next generation of joined up analytics and operational IT estates will retrofit user and role level analytical security into a world of application to application authentication.

Monday, January 06, 2014

Six reasons your Big Data Hadoop project will fail in 2014

Ok so Hadoop is the bomb, Hadoop is the schizzle, Hadoop is here to solve world hunger and all problems.  Now I've talked before about some of the challenges around Hadoop for enterprises but here are six reasons that Information Week is right when it says that Hadoop projects are going to fail more often than not.

1. Hadoop is a Java thing not a BI thing
The first is the most important challenge, I'm a Java guy, I'm a Java guy who thinks that Java has been driven off a cliff by its leadership in the last 8 years but its still one of the best platforms out there.  However the problems that Hadoop is trying to address are analytics problems, BI problems.  Put briefly BI guys don't like Java guys and Java guys don't like BI guys.  For Java guys Hadoop is yet more proof that they can do everything, but BI guys know that custom build isn't an efficient route to deliver all of those BI requirements.

On top of that the business folks know SQL, often they really know SQL, SQL is the official language of business and data.  So a straight 'No-SQL' approach is doomed to fail as you are speaking French to the British.  2014 will be the year when SQL on Hadoop becomes the norm but you are still going to need your Java and BI guys to get along, and you are going to have to recognise that SQL beats No-SQL.

2. You decide to roll-your own
Hadoop is open source, all you have to do is download it, install it and off you go right?  There are so many cases of people not doing that right that there is an actual page explaining why they won't accept those as bugs.  Hadoop is a bugger to install, it requires you to really understand how distributed computing works, and guess what?  You thought you did but it turns out you really didn't.  Distributed computing and multi-threaded computing are hard.

There are three companies you need to talk to Pivotal, Cloudera and Hortonworks and how easy can they make it? Well Pivotal have an easy Pivotal HD Hadoop Virtual Machine to get you started and even claim that that they can get you running a Hadoop cluster in 45 minutes.

3. You are building a technical proof of concept... why?
One reason that your efforts will fail is that you are doing a 'technical proof of concept' at the end of which you will amazingly find that something used in some of the biggest analytics challenges on planet earth at the likes of Yahoo fits your much, much smaller challenge.  Well done, you've spent money proving the obvious.

Now what?  How about solving an actual business problem?  Actually why didn't you start by solving an actual business problem as a way to see how it would work for what the business faces?  Technical proof of concepts are pointless, you need to demonstrate to the business how this new technology solve their problems in a better (cheaper, faster, etc) way.

4. You didn't understand what Hadoop was bad at
Hadoop isn't brilliant at everything analytical... shocking eh?  So that complex analytics you want to do which is effectively a complex 25 table join and then do the analytics... yeah that really isn't going to work too well.  Those bits where you said that you could do that key business use case faster and cheaper and then it took 2 days to run?

Hadoop is good a some things, but its not good at everything.  That is why folks are investing in SQL technologies on top of Hadoop, some of which like Pivotal's HAWQ or Cloudera's Impala, with Pivotal already showing how the bridge between traditional MPP and Hadoop is going to be made.

5. You didn't understand that its part of the puzzle
One of the big reasons that Hadoop pieces fail to really deliver is that they are isolated silos, they might even be doing some good analytics but people can't see that analytics where they care about it.  Sure you've put up some nice web-pages for people but they don't use that in their daily lives.  They want to see the information pushed into the Data Warehouse so they can see it in their reports, they want it pushed to the ERP so they can make better decisions... they might want it in many many places but you've left it in the one place that they don't care about it.

When looking at the future of your information landscape you need to remember that Hadoop and NoSQL are just a new tool, a good new tool and one that has a critical part to play but its just one new tool in your toolbox.

6. You didn't change
The biggest reason that your Hadoop project will fail however is that you've not changed some of your basic assumptions and looked how Hadoop enables you to do things differently.  So you are still doing ETL to transform into some idealised schema which is based on a point in time view of what is required.  You are doing that into a Hadoop cluster which couldn't care less about redundant or unused data and where the costs of that are significantly lower than doing another set of ETL development.

You've carried on thinking about grand enterprise solutions to which everyone will come and be beholden to your technical genius.

What you've not done is sit back and think 'the current way sucks for the business can I change that?' because if you had you'd have realised that using Hadoop as a Data substrate/lake layer makes more sense than ETL and you'd have realised that its actually local solutions that get used the most not corporate ones.

Your Hadoop project will fail because of you
The main reason Hadoop projects will fail is because you approach using a new technology with an old mindset, you'll try and build a traditional BI solution in a traditional BI way and you'll not understand that Java doesn't work like that, you'll not understand how Map Reduce is different to SQL and you'll plough on regardless and blame the technology.

Guess what though?  The technology works at massive scale, much, much bigger than anything you've ever deployed.  Its not the technology, its you.

So what to do?
.... I think I'll leave that for another post