Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Tuesday, January 20, 2015

Big Data and the importance of Meta-Data

Data isn't really respected in businesses, you can see that because unlike other corporate assets there is rarely a decent corporate catalog that shows what exists and who has it.  In the vast majority of companies there is more effort and automation put into tracking laptops than there is into cataloging and curating information.

Historically we've sort of been able to get away with this because information has resided in disparate systems and even those which join it together, an EDW for instance, have only had a limited number of sources and have viewed the information only in a single way (the final schema).  So basically we've relied on local knowledge of the information to get by.  This really doesn't work in a Big Data world.

The whole point in a Big Data world is having access to everything, being able to combine information from multiple places within a single Business Data Lake so you can allow the business to create their own views.

Quite simply without Meta-Data you are not giving them any sort of map to find the information they need and help them understand the security required.  Meta-Data needs to be a day one consideration on a Big Data program, by the time you've got a few dozen sources imported its going to be a pain going back and adding the information.  This also means the tool used to search the Meta-Data is going to be important.

In a Big Data world Meta-Data is crucial to make the Data Lake business friendly and essential in ensuring the data can be secured.    Lets be clear here HCatalog does matter but its not sufficient, you can do a lot with HCatalog but that is only the start because you've got to look about where information comes from, what its security policy is, where you've distilled that information to.  So its not just about what is in the HDFS repository its about what you've distilled into SQL or Data Science views, its about how the business can access that information not just "you can find it here in HDFS".

This is what Gartner were talking about in the Data Lake Fallacy but as I've written elsewhere, that sort of missed the point that HDFS isn't the only part of a data lake and EDW approaches only solve one set of problems not the broader challenge of Big Data.

Meta-Data tools are out there, and you've probably not really looked at them but here is what you need to test (not a complete list, but these for me are the must have requirements).
  1. Lineage from source - can it automatically link to the loading processes to say where information came from?
  2. Search - Can I search to find the information I want?  Can a non-technical user search?
  3. Multiple destinations - can it support HDFS, SQL and analytical destinations
  4. Lineage to destination - can it link to the distillation process and automatically provide lineage to destination
  5. Business View - can I model the business context of the information (Business Service Architecture style)
  6. My own attributes - can I extend the Meta-data model with my own views on what is required?
The point of modelling in a business context is really important.  Knowing information came from an SAP system is technically interesting, but knowing its Procurement data that is blessed & created by the procurement department (as opposed to being a secondary source) is significantly more valuable.  If you can't present the meta-data in a business structure you aren't going to get the business users able to use it, its just another IT centric tool.

The advantage of Business Service structured meta-data is that it matches up to how you evolve and manage your transactional systems as well.


Thursday, January 15, 2015

Security Big Data - Part 7 - a summary

Over six parts I've gone through a bit of a journey on what Big Data Security is all about.
  1. Securing Big Data is about layers
  2. Use the power of Big Data to secure Big Data
  3. How maths and machine learning helps
  4. Why its how you alert that matters
  5. Why Information Security is part of Information Governance
  6. Classifying Risk and the importance of Meta-Data
The fundamental point here is that encryption and ACLs provide only a basic hygiene factor when it comes to securing Big Data.  The risk and value of information is increasing and by creating Big Data solutions businesses are creating more valuable and therefore more at risk information solutions.  This means that Information Security needs to become a fundamental part of Information Governance and that new ways of securing that information are required.

This is where Big Data comes to its own rescue through the use of large data sets which enable new generations of algorithms to identify and then alert based on the risk and the right way to handle it.  This all requires you to consider Information Security as a core part of the Meta-data that is captured and governed around information.

The time to start thinking, planning and acting on Information Security is now, its not when you become the next Target or when one of your employees becomes your own personal Edward Snowden, its now and its about having a business practice and approach that considers information as a valuable asset and secures it in the same way as other assets in a business are secured. 

Big Data Security is a new generation of challenges, and a new generation of risks, these require a new generation of solutions and a new corporate culture where information security isn't just left to a few people in the IT department.

Tuesday, January 13, 2015

Securing Big Data Part 6 - Classifying risk

So now your Information Governance groups consider Information Security to be important you have to then think about how they should be classifying the risk.  Now there are docs out there on some of these which talk about frameworks.  British Columbia's government has one for instance that talks about High, Medium and Low risk, but for me that really misses the point and over simplifies the problem which ends up complicating implementation and operational decisions.

In a Big Data world its not simply about the risk of an individual piece of information, its about the risk in context.  So the first stage of classification is "what is the risk of this information on its own?" its that sort of classification that the BC Government framework helps you with.  There are some pieces of information (The Australian Tax File Number for instance) where their corporate risk is high just as an individual piece of information.  The Australian TFN has special handling rules and significant fines if handled incorrectly.  This means its well beyond "Personal Identification Information" which many companies consider to be the highest level.  So at this level I'd recommend having Five risk statuses

  1. Special Risk - Specific legislation and fines apply to this piece of information
  2. High - losing this information has corporate reputation and financial risk
  3. Medium - losing this information can impact corporate competitiveness
  4. Low - losing this information has no corporate risk
  5. Public - the information is already public
The point here is that this is about information as a single entity, a personal address, a business registration, etc.  That is only the first stage when considering risk.

The next stage is considering the Direct Aggregation Risk this is about what happens when you combine two pieces of information together, do that change the risk.  The categories remain the same but here we are looking at other elements.  So for instance address information would be low risk or public, but when combined with a person that link becomes higher risk.  When looking at corporate information on sales that might be medium risk, but when that is tied to specific companies or revenue it could become a bigger risk.  Also at this stage you need to look at the policy of allowing information to be combined and you don't want to have a "always no" policy.

So what if someone wants to combine personal information with twitter information to get personal preferences?  Is that allowed?  What is the policy for getting approval for new aggregations, how quickly is risk assessed and is business work allowed to continue while the risk is assessed? When looking at Direct Aggregation you are often looking at where the new value will come from in Big Data so you cannot just prevent that value being created.  So setting up clear boundaries of where approval is required (combining PII information with new sources requires approval for instance) and where you can get approval after the fact (sales data with anything is ok, we'll approve at the next quarterly meeting or modify policy).

The final stage is the most complex its the Indirect Aggregation Risk that is the risk of where two sets of aggregated results are combined and though independently they are not high risk the pulling together of that information constitutes a higher level risk.  The answer to this is actually to simplify the problem and consider aggregations not as just aggregations but as information sources in their own right. 

This brings us to the final challenge in all this classification: Where do you record the risk?

Well this is just meta-data, but that is often the area that companies spend the least amount of time thinking about but when looking at massive amounts of data and particularly disparate data sources and their results then Meta-Data becomes key to big data.  But lets look just at the security side at the moment.


Data Type Direct Risk
Customer Collection Medium
Tax File Number Field Special
Twitter Feed Collection Public

and for Aggregations

Source 1 Source 2 Source 3 Source 4 Aggregation Name Aggregation Risk
Customer Address Invoice Payments Outstanding Consumer Debt High
Customer Twitter Locaiton Customer Locations Medium
Organization Address Invoice Payments Outstanding Company Debt Low


The point here is that you really need to start thinking about how you automate this, what tools you need.  In a Big Data world the heart of security is about being able to classify the risk and having that inform the Big Data anomaly detection so you can inform the right people and drive the risk.

This gives us the next piece of classification that is required which is about understanding who gets informed when there is an information breach.  This is a core part of the Information Governance and classification approach, because its hear that the business needs to say "I'm interested when that specific risk is triggered".  This is another piece of Meta-data and one that then informs the Big Data security algorithms who should be alerted.

If classification isn't part of your Information Governance group, or indeed you don't even have a business centric IG group then you really don't consider either information or its security to be important.

Other Parts in the series
  1. Securing Big Data is about layers
  2. Use the power of Big Data to secure Big Data
  3. How maths and machine learning helps
  4. Why its how you alert that matters
  5. Why Information Security is part of Information Governance

Monday, January 12, 2015

Securing Big Data Part 5 - your Big Data Security team

What does your security team look like today?

Or the IT equivalent, "the folks that say no".  The point is that in most companies information security isn't actually something that is considered important.  How do I know this?  Well because basically most IT Security teams are the equivalent of the nightclub bouncers, they aren't the people who own the club, they aren't as important as the barman, certainly not as important as the DJ and in terms of Nightclub strategy their only input will be on the ropes being set up outside the club.

If Information is actually important then information security is much more than a bunch of bouncers trying to keep undesirables out.  Its about the practice of information security and the education of information security,  in this Information security is actually a core part of Information Governance and Information Governance is very much a business led thing.

Big Data increases the risks of information loss, because fundamentally you are not only storing more information you are centralizing more information which means more inferences can be made, more links made and more data stolen.  This means that historical thefts which stole data from a small number of systems risk being dwarfed by Big Data hacks which steal huge sets or even runs algorithms within a data lake and steals the results.

So when looking at Big Data security you need to split into three core groups
The point here is that this governance is exactly the same as your normal Data governance, its essential that Information Security becomes a foundation element of information governance.  The three different parts of governance are set up because there are different focuses

  1. Standards - sets the gold standard of what should be achieved
  2. Policy - sets what can be achieved right now (which may not meet the gold standard)
  3. KPI Management - tracks compliance to the gold standard and adherence to policy
The reason these are not just a single group is that the motivations are different.  Standards groups set up what would be ideal, its against this ideal that progress can be tracked.  If you combine Standards groups with Policy groups you end up with Standards which are 'the best we can do right now' which doesn't give you something to track towards over multiple years.

KPI management is there to keep people honest.  This is the same sort of model I talked about around SOA Governance and its the same sort of model that whole countries use, so it tends to surprise me when people don't understand the importance of standards v policy and the importance of tracking and judging compliance independently from those executing.

So your Big Data Security team starts and ends with the Information Governance team, if information security isn't a key focus for that team then you aren't considering information as important and you aren't worried about information security.


Other Parts in the series
  1. Securing Big Data is about layers
  2. Use the power of Big Data to secure Big Data
  3. How maths and machine learning helps
  4. Why its how you alert that matters

Friday, January 09, 2015

Securing Big Data - Part 4 - Not crying Wolf.

In the first three parts of this I talked about how Securing Big Data is about layers, and then about how you need to use the power of Big Data to secure Big Data, then how maths and machine learning helps to identify what is reasonable and was is anomalous.

The Target Credit Card hack highlights this problem.  Alerts were made, lights did flash.  The problem was that so many lights flashed and so many alarms normally went off that people didn't know how to separate the important from the noise.  This is where many complex analytics approaches have historically failed: they've not shown people what to do.

If you want a great example of IT's normal approach to this problem then the ethernet port is a good example.
What does the colour yellow mean normally? Its a warning colour, so something that flashes yellow would be bad right?  Nope it just means that a packet has been detected... err but doesn't the green light already mean that its connected?  Well yes but that isn't the point, if you are looking at a specific problem then the yellow NOT flashing is really an issue... so yellow flashing is good, yellow NOT flashing is bad...

Doesn't really make sense does it?  Its not a natural way to alert.  There are good technical reasons to do it that way (its easier technically) but that doesn't actually help people.

With security this problem becomes amplified and is often made worse through centralising reactions to a security team which knows security but doesn't know the business context.  The challenge therefore is to categorize the type of issue and have different mechanisms for each one.  Broadly these risks split into 4 groups
Its important when looking at risks around Big Data to understand what group a risk falls into which then indicates the right way to alert.  Its also important to recognize that as information becomes available an incident may escalate between groups.

So lets take an example.  A router indicates that its receiving strange external traffic.  This is an IT operations problem and it needs to be handled by the group in IT ops which deals with router traffic.  Then the Big Data security detection algorithms link that router issue to the access of sales information from the CRM system.  This escalates the problem to the LoB level, its now a business challenge and the question becomes a business decision on how to cut or limit access.  The Sales Director may choose to cut all access to the CRM system rather than risk losing the information, or may consider it to be a minor business risk when lined up against closing the current quarter.  The point is that the information is presented in a business context, highlighting the information at risk so a business decision can be taken.

Now lets suppose that the Big Data algorithms link the router traffic to a broader set of attacks on the internal network, a snooping hack, this is where the Chief Information Security Officer comes in, that person needs to decide how to handle this broad ranging IT attack, do they shut down the routers and cut the company off from the world?  Do they start dropping and patching, and do they alert law enforcement.

Finally the Big Data algorithms find that credit card data is at risk, suddenly this becomes a corporate reputation risk issue and needs to go to the Chief Risk Officer (or the CFO if they have that role) to take the pretty dramatic decisions that need to be made when a major cyber attack is underway.

The point here though is that it needs to be systematic how its highlighted and escalated, it can't all go through a central team.  The CRO needs to be automatically informed when the risk is sufficient, but only be informed then.  If its a significant IT risk then its the job of the CISO to inform the CRO, not for every single risk to be highlighted to the CRO as if they need to deal with them.

The basic rule is simple: "Does the person seeing this alert care about this issue? Does the person seeing this alert have the authority to do something about this issue? and finally: does the person seeing this alert have someone lower in their reporting chain who answers 'yes' to those questions?"

If you answer "Yes, Yes, No" then you've found the right level and then need to concentrate on the mechanism.  If its "Yes, Yes, Yes" then you are in fact cluttering if you show them everything that every person in their reporting tree handles as part of their job.

In terms of the mechanism its important to think on that "flashing yellow light" on the Ethernet port.  If something is ok then "Green is good", if its an administrative issue (patch level on a router) then it needs to be flagged into the tasks to be done.  If its an active and live issue it needs to come front and center.

In terms of your effort when securing Big Data you should be putting more effort into how you react than on almost any other stage in the chain.  If you get the last part wrong then you lose all the value of the former stages.  This means you need to look at how people work, look at what mechanisms they use, so should the CRO be alerted via a website they have to go to or via an SMS to the mobile they carry around all the time and that take them to a mobile application on that same device? (hint: its not the former).

This is the area where I see the least effort made and the most mistakes being made, mistakes that are normally "Crying Wolf" so you show every single thing and expect people to filter out thousands of minor issues and magically find the things that matter.

Target showed that this doesn't work.



Thursday, January 08, 2015

Securing Big Data - Part 3 - Security through Maths

In the first two parts of this I talked about how Securing Big Data is about layers, and then about how you need to use the power of Big Data to secure Big Data.  The next part is "what do you do with all that data?".   This is where Machine Learning and Mathematics comes in, in other words its about how you use Big Data analytics to secure Big Data.

What you want to do is build up a picture of what represents reasonable behaviour, that is why you want all of that history and range of information.  Its the full set of that across not single actions but millions of actions and interactions that builds the picture of reasonable.  Its reasonable for a sys-admin to access a system, its not reasonable for them to download classified information to a USB stick.

A single request is something you control using an ACL, but that doesn't include the context of the request (its 11pm, why is someone accessing that information at all that late?).

You also need to look at the aggregated requests - They've looked at the next quarters sales forecast while also browsing external job hunting sites and typing up a resignation letter.

Then you need to look at the history of that - Oh its normal for someone to be doing that at quarter end, all the sales people tend to do that.

This gives us the behaviour model for those requests which leads to us understanding what is considered reasonable.  From reasonable we can then identify anomalous behaviour (behaviour that isn't reasonable).

No human defined and managed system can handle this amount of information, but Machine Learning algorithms just chomp up this sort of data and create the models for you.  This isn't a trivial task and its certainly massively more complex than the sorts of ACLs, encryption criteria and basic security policies that IT is used to.  These algorithms need tending, they need tuning and they need monitoring.

Choosing the right type of algorithms (and there are LOTS of different choices) is where Data Scientists come in, they can not only select the right type of algorithm but also tune and tend it so it produces the most effective set of results consistently.

What this gives you however is business centric security, that is security that looks at how a business operates.  Anomalous Behaviour Detection therefore represents the way to secure Big Data by using Big Data.

The final challenge is then on how to alert people so they actually react.

Wednesday, January 07, 2015

Securing Big Data - Part 2 - understanding the data required to secure it

In the first part of Securing Big Data I talked about the two different types of security.  The traditional IT and ACL security that needs to be done to match traditional solutions with an RDBMS but that is pretty much where those systems stop in terms of security which means they don't address the real threats out there, which are to do with cyber attacks and social engineering.  An ACL is only any good if people do what they are expected to do.  Both the Target Credit Card hack and the NSA hack by Edward Snowden involved some sort of insider breach, so ACL approaches were part of the problem not the solution.

With Big Data and Hadoop we have another option: to actually use Hadoop to secure the information in Hadoop.  This means getting more information into Hadoop, particularly network and machine log information.

So why do you need all this information?  Well you need obviously to know the access requests and what data is accessed, but you also need to know where the network packets go.  Are they going to a normal desktop or corporate laptop or do they head out of the company and onto the internet?  Does the machine that is downloading information have a USB drive plugged in and the files being copied to it?  Is there a person logged into the machine or is it just a headless workstation?

The point here is that when looking at securing Big Data you need to take a Big Data view of security.  This means going well beyond traditional RDBMS approaches and not building a separate security silo but instead looking at the information, adding in information about how it is accessed and information about where that is accessed from and how.
This information builds up a single request view, but by storing that information you can start building up a profile of information to understand what other information has been requested and how that matches or can be linked.  Thus if someone makes 100 individual requests that on their own are OK but taken in aggregate represent a threat then its the storage of that history of requests that give you the security.

So to secure Big Data we don't need to just look at individual requests, the ACL model, we need to start building up the broad picture of how the information is accessed, what context it is accessed within and where that information ends up.  To secure Big Data, or in reality any data, is about looking at the business context rather than the technical focus of encryption and ACLs which are simply hygiene factors. 

Tuesday, January 06, 2015

Securing Big Data - Part 1

As Big Data and its technologies such as Hadoop head deeper into the enterprise so questions around compliance and security rear their heads.

The first interesting point in this is that it shows the approach to security that many of the Silicon Valley companies that use Hadoop at scale have taken, namely pretty little really.  It isn't that protecting information has been seen as a massively important thing as there just aren't the basic pieces within Hadoop to do that.  Its not something that has been designed to be secure from day one, its designed to be easy to use and to do a job.  Its as governments and enterprises with compliance departments begin to look at Hadoop that these requirements are really surfacing.

But I'm not going to talk about the encryption and tokenization solutions, those are hygiene factors but not really about securing the information, because its still going to be accessible to people with the right permissions.  Because of that its still at risk from Cyber or Social Engineering attacks.  The means that security for Big Data is about layering and really its about two different ways of viewing security.

IT solutions, technical solutions, can really help around access control, encryption but they don't really help you to actually prevent information being stolen.  What stops information leaking is about the business decisions you make.

The first one of those is: what information do I actually store?  So you might decide that you'll just store IDs for customer information in Hadoop and then use a more traditional store to provide the cross reference of IDs back to customer information when its required.

Above this however is actually using Hadoop to protect Hadoop.  What is Hadoop?  Its a Big Data analytics platform.  What is the biggest threat around Data?  Social Engineering or Cyber 'inside the walls' attacks.  So the first stage in securing Hadoop is to do the IT basics, but the next stage of securing the business of information is about using the power of Hadoop to secure the information stored within it.

Thursday, May 22, 2014

How to select a Hadoop distro - stop thinking about Hadoop

Scoop, Flume, PIG, Zookeeper.  Do these mean anything to you?  If they do then the odds are you are looking at Hadoop.  The thing is that while that was cool a few years ago it really is time to face it that HDFS is a commodity, Map Reduce is interesting but not feasible for most users and the real question is how we turn all that raw data in HDFS into something we can actually use.

That means three things

  1. Performant and ANSI compliant SQL matters - if you aren't able to run traditional reporting package then you are making people change for no reason.  If you don't have an alternative then you aren't offering an answer
  2. Predictive analytics, statistical, machine learning and whatever else they want - this is the stuff that will actually be new to most people
  3. Reacting in real-time - and I mean FAST, not BI fast but ACTUALLY fast
The last one is about how you ingest data and then perform real time analytics which are able to incorporate forecasting information from Hadoop into real-time feedback that can be integrated into source systems.

So Hadoop and HDFS are actually the least important in your future, its critical but not important.  I've seen people spend ages looking at the innards rather than just getting on and actually solving problems.  Do you care what your mobile phone network looks like internally?  Do you care what the wiring back to the power station looks like?  HDFS is that for Data, its the critical substrate, something that needs to be there.  But where you should concentrate your efforts is on how it supports the business use cases above.

How does it support ANSI compliant SQL, how does it support your standard reporting packages.  How will you add new types of analytics, does it support the advanced analytics tools your business already successfully uses?  How does it enable real-time analytics and integration?  

Then of course its about how it works within your enterprise, so how does it work with data management tools, how does its monitoring fit in with your existing tools.  Basically is it a grown-up or a child in the information sand-pit.

Now this means its not really about the Hadoop or HDFS piece itself, its about the ecosystem of technologies into which it needs to integrate.  Otherwise its going to just be another silo of technologies that don't work well with everything else and ultimately doesn't deliver the value you need.

Thursday, April 24, 2014

Data Lakes will replace EDWs - a prediction

Over the last few years there has been a trend of increased spending on BI, and that trend isn't going away.  The analyst predictions however have, understandably, been based on the mentality that the choice was between a traditional EDW/DW model or Hadoop.  With the new 'Business Data Lake' type of hybrid approach its pretty clear that the shift is underway for all vendors to have a hybrid approach rather than a simple choice between Hadoop or a Data Warehouse.  So taking the average of a few analysts figures we get a graph that looks like this
In other words 12 months ago there was no real prediction at hybrid architectures. Now however we see SAP talking about hybrid, IBM about DB2 and Hadoop and Teradata doing the same. This means we need to think about what that means.  What it means is that we'll see a switch between Traditional approaches and hybrid Data Lake centric architectures that will start now and accelerate rapidly.
My prediction therefore is that these Hybrid Data Lake architectures will rapidly become the 'new normal' in enterprise computing.  There will still be more people taking traditional approaches this year and next but the choice for people looking at this is whether they want to get on the old bus or the new bus.  This for me is analogous to what we saw around proprietary EAI against Java based EAI around the turn of the century.  People who chose the old school found themselves in a very bad place once the switch had happened.

What I'm also predicting is we will see a drop rather than a gain in 'pure' Hadoop projects as people look to incorporate Hadoop as a core part of an architecture rather than standalone HDFS silos.


Thursday, February 06, 2014

NoSQL? No Thanks

There continues to be a disproportionate amount of hype around 'NoSQL' data stores.  By disproportionate I mean 'completely and utterly out of scale with the actual problems of the vast majority of companies'.  I wrote before about 'how NoSQL became more SQL'.  The point I made there is now more apparent the more I work with companies on Big Data challenges.

There are three worlds of data interaction developing

  1. Traditional Reporting - its SQL, deal with it
  2. Complex Analytics - its about the tools and languages, R, SAS, MADLib, etc
  3. Embedding in applications
The point here is that getting all those reports, and more importantly all those people who write reports, re-written using a NoSQL approach makes no sense.  Sure Statistical languages and tools aren't SQL, but is it right to claim they are NoSQL approaches?  I'd argue not.  The use of a NoSQL database such as Hadoop or MongoDB is about the infrastructure behind it, its hidden from the users so while it make good technical sense to use such a data store it really doesn't change the way the users are working.

The point in these two areas is that its about the tools that people use to interact with information and supporting the languages they use to interact with that information.  The infrastructural question is simply one of abstraction and efficiency.  Like caring about whether your laptop is connecting over 802.11g or 802.11n, yes I know you care but that is because you are a techy.  The person using their iPad doesn't care as long as the videos stream from YouTube successfully.  Its the end user experience that counts not the infrastructure.

The final case is the world of developers, and here is another shock: business users couldn't care less what developers use as long as they deliver. If you can deliver better using SQL then use that, if you use NoSQL then use that, if you can deliver better by using a random number generator and the force then go for it.  Again however the business doesn't care if you use NoSQL or not and nor should they. What they care about is that it works, meets the business requirements and non-functionals and can be changed when they need it to.

Stop trying to force a technical approach onto the business, start hiding your technical infrastructure while giving them the tools and languages that they want.

Monday, January 27, 2014

Six things to make your Big Data project succeed

So I wrote about why your Hadoop project will fail so I think its only right that I should follow up with some things that you can do to actually make the Big Data project you take on succeed.  The first thing you need to do is stop trying to make 'Big Data' succeed and instead start focusing on how you educate the business on the value of information and then work out how to deliver new value... that just so happens to be delivered with Big or Fast Data technologies

Don't try and change the business
The first thing is to stop trying to see technology as being a goal in itself and complaining when the business doesn't recognise that your 'magic' technology is the most important thing in the world.  So find out how the business works, look at how people actually work day to day and see how you can improve that.

Sounds simple?  Well the good news is that it is, but it means you need to forget about technology until you know how the business works.

Explain why Information Matters
The next bit after you've understood the business better is to explain to them why they should care about information.  Digitization is the buzzword you need to learn, folks like MIT Sloan (Customer Facing Digitization), Harvard Business Review (are you ready for Digitization?) and Davos (Digitization and Growth) are saying that this is the way forwards.  And what is Digitization? In the raw its just about converting stuff into digital formats, but the reality is that what its about is having an information and analytical driven business.  The prediction of all the business schools is that companies that do this will out perform their competition.

This is an important step, its about shifting Information from being a technology and IT conversation towards the business genuinely seeing information as a critical part of business growth.  Its also about you as an IT professional learning how to communicate technology changes in the language the business wants to hear.  They don't want to hear 'Hadoop' they want to hear 'Digitization'.

Find a problem that needs a new solution
The next key thing is finding a problem that isn't well served by your current environments.  If you could solve a problem by just having a new report on an EDW then it really doesn't prove anything to use new technologies to do that in a more time consuming way.  The good news is there are probably loads of problems out there not well served by your current environments.  From volume challenges around sensor data, click stream through to real-time analytics, predictive analytics through data discovery and ad-hoc information solutions there are lots of business problems.

Find that problem, find the person or group in the business that cares about having that problem solves and be clear about what the benefits of solving that problem are.

Get people with the 'scars and ribbons'
What do I do when I work with a new technology?  Two things, firstly I get some training and from that build something for myself that helps me learn.  If I'm doing it at work in building a business I then go and find someone who has already done this before and hire them or transfer them into my team.

Bill Joy once said that the smartest people weren't at Sun so they should learn from outside.  I'm not Bill Joy, you aren't Bill Joy, so we can certainly learn from outside.  Whether this means going to a consultancy who has done it before, hiring people in who have done it before doesn't really matter.  The point is that unless you really are revolutionising the IT market you are doing something that someone has done before, so your best bet is to learn from their example.

It stuns me how many people embark on complex IT projects having never used the technology before and are then surprised that the project fails.  Get people with the 'scars and ribbons' who can tell you what not to do which is massively more important than what to do.

Throw out some of your old Data Warehouse thinking
The next bit is something that you need to forget, a cherished truth that no longer holds.  Get rid of the notion that your job as a data architect is to dictate a single view to the business.  Get rid of the thought that the cherished ETL process.  Land the data in Hadoop, all the data you can, don't worry if you don't think you might not use it, you are landing it Hadoop and then turning into the views or analytics.  There is no benefit in not taking everything across and lots of benefits for doing so.

In other words you've got the problem, that is the goal, now go and collect all the data but not worry about the full A-Z straight away by defining Z and working backwards.  Understand the data areas, drop that into Hadoop and then worry about what the right A-Z is today knowing that if its a different route tomorrow you've got the data ready to go without updating the integration.

Then if you have another problem that needs access to the same data don't automatically try and make one solution do two things.  Its perfectly ok to create a second solution to solve that problem on top of Hadoop.  You don't need everyone to agree on a single schema, you just need to be able to solve the problem.  The point here is that to get different end-results you need to start thinking differently.

Don't get hung up on NoSQL, don't get hung up on Hadoop
The final thing is the dirty secret of the Hadoop world that has rapidly become the bold proclamation - NoSQL really isn't for everyone and SQL is perfectly good for lots of cases.  Hive, Impala, HAWQ are all addressing exactly that challenge, and you shouldn't limit yourself to Hadoop friendly approaches, if the right way is to push it to your existing data warehouse from Hadoop... do it.  If the requirement is to have some fast data processing then do that.

The point here is your goal is to show how the new technologies are more flexible and better able to adapt to the business and how the new IT approach is to match what the business wants not to try and force an EDW onto it every time.


The point here is that making your Big Data program succeed is actually about having the business care about the value that information brings and then fitting your approach to match what the business wants to achieve.

The business is your customer, time do do what they want, not force an EDW down their throats.

Monday, January 06, 2014

Six reasons your Big Data Hadoop project will fail in 2014

Ok so Hadoop is the bomb, Hadoop is the schizzle, Hadoop is here to solve world hunger and all problems.  Now I've talked before about some of the challenges around Hadoop for enterprises but here are six reasons that Information Week is right when it says that Hadoop projects are going to fail more often than not.

1. Hadoop is a Java thing not a BI thing
The first is the most important challenge, I'm a Java guy, I'm a Java guy who thinks that Java has been driven off a cliff by its leadership in the last 8 years but its still one of the best platforms out there.  However the problems that Hadoop is trying to address are analytics problems, BI problems.  Put briefly BI guys don't like Java guys and Java guys don't like BI guys.  For Java guys Hadoop is yet more proof that they can do everything, but BI guys know that custom build isn't an efficient route to deliver all of those BI requirements.

On top of that the business folks know SQL, often they really know SQL, SQL is the official language of business and data.  So a straight 'No-SQL' approach is doomed to fail as you are speaking French to the British.  2014 will be the year when SQL on Hadoop becomes the norm but you are still going to need your Java and BI guys to get along, and you are going to have to recognise that SQL beats No-SQL.

2. You decide to roll-your own
Hadoop is open source, all you have to do is download it, install it and off you go right?  There are so many cases of people not doing that right that there is an actual page explaining why they won't accept those as bugs.  Hadoop is a bugger to install, it requires you to really understand how distributed computing works, and guess what?  You thought you did but it turns out you really didn't.  Distributed computing and multi-threaded computing are hard.

There are three companies you need to talk to Pivotal, Cloudera and Hortonworks and how easy can they make it? Well Pivotal have an easy Pivotal HD Hadoop Virtual Machine to get you started and even claim that that they can get you running a Hadoop cluster in 45 minutes.

3. You are building a technical proof of concept... why?
One reason that your efforts will fail is that you are doing a 'technical proof of concept' at the end of which you will amazingly find that something used in some of the biggest analytics challenges on planet earth at the likes of Yahoo fits your much, much smaller challenge.  Well done, you've spent money proving the obvious.

Now what?  How about solving an actual business problem?  Actually why didn't you start by solving an actual business problem as a way to see how it would work for what the business faces?  Technical proof of concepts are pointless, you need to demonstrate to the business how this new technology solve their problems in a better (cheaper, faster, etc) way.

4. You didn't understand what Hadoop was bad at
Hadoop isn't brilliant at everything analytical... shocking eh?  So that complex analytics you want to do which is effectively a complex 25 table join and then do the analytics... yeah that really isn't going to work too well.  Those bits where you said that you could do that key business use case faster and cheaper and then it took 2 days to run?

Hadoop is good a some things, but its not good at everything.  That is why folks are investing in SQL technologies on top of Hadoop, some of which like Pivotal's HAWQ or Cloudera's Impala, with Pivotal already showing how the bridge between traditional MPP and Hadoop is going to be made.

5. You didn't understand that its part of the puzzle
One of the big reasons that Hadoop pieces fail to really deliver is that they are isolated silos, they might even be doing some good analytics but people can't see that analytics where they care about it.  Sure you've put up some nice web-pages for people but they don't use that in their daily lives.  They want to see the information pushed into the Data Warehouse so they can see it in their reports, they want it pushed to the ERP so they can make better decisions... they might want it in many many places but you've left it in the one place that they don't care about it.

When looking at the future of your information landscape you need to remember that Hadoop and NoSQL are just a new tool, a good new tool and one that has a critical part to play but its just one new tool in your toolbox.

6. You didn't change
The biggest reason that your Hadoop project will fail however is that you've not changed some of your basic assumptions and looked how Hadoop enables you to do things differently.  So you are still doing ETL to transform into some idealised schema which is based on a point in time view of what is required.  You are doing that into a Hadoop cluster which couldn't care less about redundant or unused data and where the costs of that are significantly lower than doing another set of ETL development.

You've carried on thinking about grand enterprise solutions to which everyone will come and be beholden to your technical genius.

What you've not done is sit back and think 'the current way sucks for the business can I change that?' because if you had you'd have realised that using Hadoop as a Data substrate/lake layer makes more sense than ETL and you'd have realised that its actually local solutions that get used the most not corporate ones.

Your Hadoop project will fail because of you
The main reason Hadoop projects will fail is because you approach using a new technology with an old mindset, you'll try and build a traditional BI solution in a traditional BI way and you'll not understand that Java doesn't work like that, you'll not understand how Map Reduce is different to SQL and you'll plough on regardless and blame the technology.

Guess what though?  The technology works at massive scale, much, much bigger than anything you've ever deployed.  Its not the technology, its you.

So what to do?
.... I think I'll leave that for another post

Thursday, December 05, 2013

How Business SOA thinking impacts data

Over the years I've written quite a bit about how SOA, when viewed as a tool for Business Architecture, can change some of the cherished beliefs in IT.  One of these was about how the Single Canonical Form was not for SOA and others have talked about how MDM and SOA collaborate to deliver a more flexible approach.  Central to all of these things has been that this has been about information and transactions 'in the now' the active flowing of information within and between businesses and how you make that more agile while ensuring such collaboration can be trusted.

Recently however my challenge has been around data, the post-transactional world, where things go after they've happened so people can do analytics on them.  This world has been the champion of the single canonical form, massive single schemas that aim to encompass everything, with the advantage over the operational world that things have happened so the constraints, while evident, are less of a daily impact.

The challenge of data however is that the gap between the post-transactional and the operational world has disappeared.  We've spent 30 years in IT creating a hard-wall between these ares, creating Data Warehouses which operate much like batch driven mainframes and where the idea of operational systems directly accessing them has been aggressively discouraged.  The problem is that the business doesn't see this separation.  They want to see analytics and its insight delivered back into the operational systems to help make better decisions.

So this got me thinking, why is it that in the SOA world and operational world its perfectly ok for local domains and businesses to have their own view on a piece of data, an invoice, a customer, a product, etc but when it comes to reporting they all need to agree?  Having spent a lot of time on MDM projects recently the answer was pretty simple:
They don't
With MDM the really valuable bit is the cross-reference, its the bit that enables collaboration.  The amount of standardisation required is actually pretty low.  If Sales has 100 attributes for Customer and Finance only 30 and in-fact it only takes 20 to uniquely identify the customer then its that 20 that really matter to drive the cross reference.  If there isn't any value in agreeing on the other attributes then why bother investing in it?  Getting agreement is hard enough without trying to do it in areas where the business fundamentally doesn't care.

This approach to MDM helps to create shorter more targeted programs, and programs that are really suited to enabling business collaboration.  You don't need to pass the full customer record, you just pass the ID.

So what does this combination of MDM and SOA mean for data, particularly as we want analytics to be integrated back into operations?
Data solutions should look more like Business SOA solutions and match the way the business works
In simple terms it means the sort of thinking that led to flexibly integrated SOA solutions should now be applied to Data.  Get rid of that single Schema, concentrate on having data served up in a way that matches the requirements of the business domains and concentrate governance on where its required to give global consistency and drive business collaboration.  That way you can ensure that the insights being created will be able to be managed in the same way as the operational systems.

With SOA the problem of people building applications 'in the bus' led me to propose a new architectural approach where you don't have one ESB that does everything but accept that different parts of the business will want their own local control.  The Business Service Bus concept was built around that and with the folks at IBM, SAP, Microsoft and Oracle all ensuring that pretty much everyone ends up with a couple of ESB type solutions its the sort of architecture I've seen work on multiple occasions.  That sort of approach is exactly what I now think applies to data.

The difference?

Well with data and analytical approaches you probably want to combine data from multiple sources, not just your own local information, fortunately new (Java) technologies such as Hadoop are changing the economics of storing data so instead of having to agree on schemas you can just land all of your corporate data into one environment and let those business users build business aligned analytics which sit within their domain, even if they are using information shared by others.  MDM allows that cross reference to happen in a managed way but a new business aligned approach removes the need for total agreement before anything can get done.

With Business SOA driven operations we had the ability to get all the operational data in real-time and aggregate at the BSB level if required, with Business SOA driven data approaches we can land all the information and then enable the flexibility.  By aligning both the operational and post-transactional models within a single consistent Business aligned approach we start doing what IT should be doing all along
Creating an IT estate that looks like the business, evolves like the business and that is costed in-line with the value it delivers.
Applying Business SOA thinking to data has been really interesting and what led to the Business Data Lake concept, its early days clearly but I really do believe that getting the operational and data worlds aligned to the business in a consistent way is going to be the way forwards.

This isn't a new and radical approach, its just applying what worked in operations to the data space and recognising that if the goal of analytics is to deliver insight back into operations then that insight needs to be aligned to the business operations from the start so it can adapt and change as the operational business requires.

The boundaries from the operational and post-transactional world have gone, the new boundaries are defined by the business and the governance in both areas is about enabling consistent collaboration.

Monday, July 15, 2013

Minimum on the wire, everything in the record

I've talked before about why large canonical models are a bad idea and how MDM makes SOA, BPM and a whole lot of things easier.  This philosophy of 'minimum on the wire' helps to create more robust infrastructures that don't suffer from a fragile base class problem and better match the local variations that organisations always see.

One of the things I really like about IT however is how it starts putting new challenges and how simple approaches really make it easier to address those new challenges.  One of those is the whole Big Data challenge which I've been looking at quite a bit in the last 12 months and there is a new philosophy coming out of that which is 'store everything', not 'store everything in a big data warehouse' but 'store everything as raw data in Hadoop'.    There are really three sources of 'everything'
  1. Internal Transactional systems
  2. External Information & transactional systems
  3. Message passing systems
So we now have a really interesting situation where you can minimise what is on the wire but store much more information.  By dropping the full MDM history and cross reference into Hadoop you can use that to say exactly what the information state was at a point in time when a message was passed across the bus.  In other words you can have a full-audit trace of both the request and the impact that the request had on the system.

One of the big advantages of Hadoop is that it doesn't require you to have that big canonical model and again this is where MDM really kicks in with an advantage.  If you are looking for all transactions from a given customer you just take the system x-ref from MDM as your input and then you can do a federated Map Reduce routine to get the system specific information without having to go through a massive data mapping exercise.

Hadoop means there is even less reason to have lots flying about on the wire and even more justification for MDM being at the heart of a decent information approach.

Thursday, July 11, 2013

Google and Yahoo have it easy or why Hadoop is only part of the story

We hear lots and lots of hype at the moment around Hadoop, and it is a great technology approach, but there is also lots of talk about how this approach will win because Google and Yahoo are using it to manage their scale and thus this shows that their approach is going to win in traditional enterprises and other big data areas.

Lets be clear, I'm not saying Hadoop isn't a good answer for managing large amounts of information what I'm saying is that Hadoop is only part of the story and its arguably not the most important.  I'm also saying that Google and Yahoo have a really simple problem they are attempting to fix, in comparison with large scale enterprises and the industrial internet they've got it easy.  Sure they've got volume but what is the challenge?
  1. Gazillions of URIs and unstructured web pages
  2. Performant search
  3. Serving ads related to that search
I'm putting aside the gmails and google apps for a moment as those aren't part of this Hadoop piece, but I'd argue are, like Amazon, more appropriate reference points for enterprises looking at large scale.

So why do Google and Yahoo have it easy?

First off while its an unstructured data challenge this means that data quality isn't a challege they have to overcome.  If google serve you up a page when you search for 'Steve Jones' and you see the biology prof, sex pistols guitarist and Welsh model and you are looking for another Steve Jones you don't curse google because its the wrong person, you just start adding new terms to try and find the right one,  if Google slaps the wrong google+ profile on the results you just sigh and move on.  Google don't clear up the content.

Not worrying about data quality is just part of the not having to worry about master data and reference data challenge.  Google and Yahoo don't do any master data work or reference data work, they can't as their data sets are external.  This means they don't have to set up governance boards or operational process changes to take control of data, they don't need to get multiple stakeholders to agree on definitions and no regulator will call them to account if a search result isn't quite right.

So the first reason they have it easy is that they don't need to get people to agree.

The next reason is something that Google and Yahoo do know something about and that is performance, but here I'm not talking about search results I'm talking about transactions, the need to have a confirmed result.  Boring old things like atomic transactions and importantly the need to get back in a fast time.  Now clearly Google and Yahoo can do the speed part, but they have a wonderful advantage of not having to worry about the whole transactions stuff, sure they do email at a great scale and they can custom develop applications to within an inch of their life...  but that isn't the same as getting Siebel, SAP and old Baan system and three different SOA and EAI technologies working together.  Again there is the governance challenge and there is the 'not invented here' challenge that you can't ignore.  If SAP doesn't work the way you want... well you could waste time customising it but you are better off working to what SAP does instead.

The final reason that Google and Yahoo have it easy is talent and support.  Hadoop is great, but as I've said before companies have a Hadoop Hump problem and this is completely different to the talent engines at Google and Yahoo.  Both pride themselves on the talent they hire and that is great, but they also pay top whack and have interesting work to keep people engaged.  Enterprises just don't have that luxury, or more simply they just don't have the value to hire stellar developers and then also have those stellar developers work in support.  When you are continually tuning and improving apps like Google that makes sense, when you tend to deliver into production and hand over to a support team it makes much less sense.

So there are certainly things that enterprises can learn from Google and Yahoo but it isn't true to say that all enterprises will go that way, enterprises have different challenges and some of them are arguably significantly harder than system performance as they impact culture.  So Hadoop is great, its a good tool but just because Google and Yahoo use it doesn't mean enterprises will adopt it in the same way or indeed that the model taken with Google and Yahoo is appropriate.  We've already seen NoSQL become more SQL in the Hadoop world and we'll continue to see more and more shifts away from the 'pure' Hadoop Map Reduce vision as Enterprises leverage the economies of scale but do so to solve a different set of challenges and crucially a different culture and value network.

Google and Yahoo are green field companies, built from the ground up by IT folks.  They have it easy in comparison to the folks trying to marshall 20 business divisions each with their own sales and manufacturing folks and 40 ERPs and 100+ other systems badly connected around the world.



Thursday, April 25, 2013

The Hadoop hump - why enterprises struggle to move from Proof of Concept to Enterprise deployment

At the recent Hadoop Summit in Amsterdam I noticed something that has been bothering me for a while.  Lots of companies have done some great Proof of Concepts with Hadoop but they are rarely turning those into fully blown operational solutions.  Being clear I'm not talking about the shiny, shiny web companies where the business is technology and the people who develop are the people who support, I'm talking about those dull companies that make up the 99.9% of businesses out there where IT is part of the organisation and support is normally done by separate teams.

There are three key reasons for this Hadoop hump

  1. Hadoop is addressing problems in the BI space, but is a custom build technology
  2. Hadoop has been created for developers not support
  3. BI budgets are used to vertically scaled hardware
These reasons are about people not technologies.  Hadoop might save you money on hardware and software licenses but if you are moving from report developers in the BI space to Map Reduce/R people in Hadoop and most critically requiring those same high value people in support its the people costs that prevent Hadoop being scaled.  The last one is a mental leap that I've seen BI folks struggle to make, they are used to going 'big box' and talking about horizontal scalability and HDFS really doesn't fit with their mindset.  

These are the challenges that companies like Cloudera, Pivotal and Hortonworks are going to have to address to make Hadoop really scale in the enterprise.  Its not about technical scale, its about the cost of people.

Friday, March 22, 2013

Why NoSQL became MORE SQL and why Hadoop will become the Big Data Virtual Machine

A few years ago I wrote an article about "When Big Data is a Big Con" which talked about some of the hype issues around Big Data.  One of the key points I raised was about how many folks were just slapping on Big Data badges to the same old same old, another was that Map Reduce really doesn't work they way traditional IT estates behave which was a significant barrier to entry for Hadoop as a new technology.  Mark Little took this idea and ran with it on InfoQ about Big Data Evolution or Revolution? Well at the Hadoop Summit in Amsterdam this week the message was clear...
SQL is back, SQL is key, SQL is in fact the King of Hadoop
Part of me is disappointed in this.  I've never really liked SQL and quite liked the LISPiness of Map Reduce but the reason behind this is simple.
When it comes to technology adoption its people that are key, and large scale adoption means small scale change
Think about Java.  A C language (70s concept) derivative running on a virtual machine (60s)  using some OO principles (60s) with a kickass set of libraries (90s).  It exploded because it wasn't a big leap and I think we can now see the same sort of thing with Hadoop now that its stopped with purity and gone for the mainstream.  Sure there will be some NoSQL pieces out there and Map Reduce has its uses but its this change towards using SQL that will really cause Hadoop usage to explode.What is good however is that the Hadoop philosophy remains in-tact, this isn't the Java SE 6 debacle where aiming after 'Joe Six-pack' developer resulted in a bag of mess.  This instead is about retaining that philosophy of cheap infrastructure and massive scale processing but adding a more enterprise friendly view (not developer friendly, enterprise friendly) and its that focus which matters.

Hadoop has the opportunity to become the 'JVM of Big Data' but with a philosophy that the language you use on that Big Data Virtual Machine is down to your requirements and most critically down to what people in your enterprise want to use.

Its great to see a good idea grow by taking a practical approach rather than sticking to flawed dogma. Brilliant work from the Hadoop community I salute you!