Friday, April 19, 2013

Data Snow Blindness

I am taking a break from everything else that’s going on today to tell the story of an error in data analysis and presentation, seriously affecting the strength of the argument being made, that to me is jaw-dropping – not the error itself, but people’s reaction to its being pointed out.  And that reaction is:
Nothing.
Seriously.  Nothing.  Nothing by commenters on either blog where the original analysis and reactions is prominently mentioned (Weisenthal and Krugman).  Nothing by the bloggers themselves. No change in the analysis as presented on the blog.  Dead silence. Debate continues to be waged based on the uncorrected blog data.
I have racked my brain as to why this might be.  The best reason I can think of is that most people are so focused on the “bright” parts of the data as presented, the fact that the two measures over time involved show a clear relationship, that they ignore the fact that one of the two measures is not quite the correct one to use.  It is, if you will, an example of blindness like the blindness caused by trying to look ahead into a landscape of snow reflecting the sun fiercely – the well-known phenomenon of “snow blindness.”  Data snow blindness.  And, as in the case the other snow blindness, you become truly blind – you simply don’t notice the information indicating that the analysis is off.
The Story
The story begins when Weisenthal reacts to an ongoing debate on the merits of investment in gold by doing a couple of graphs comparing the value of the S&P 500 index over time (1979 and 2007 to now) to the value of gold in the markets ($/troy ounce). His point – a valid one – is that even with recent fluctuations in gold’s price, investment in stocks outperforms investment in gold over time.
Now, as I found out in doing my post-MIT-Sloan-School-of-Management research on investment theory for my own use, it turns out that the S&P 500 is a price-only measure.  That is, the typical value of the S&P 500 that everyone quotes does not include the value of dividends that the companies issue over time, and it doesn’t include the reinvestment of those dividends.  As far as I can tell from observations over the past 15 years as well as from previous data, the dividends themselves average about 2.3% per year, and the reinvestment adds another 0.1% per year (the “geometric mean” of returns, the correct way to measure, suggests that at least until about 2007 (a period of maybe 90 years), the growth in the S&P 500 was about 10.8% per year, which meant that reinvestment of dividends over the course of a year would yield about 2.3% times 10.8% times ½ [to reflect the fact that dividends don’t all occur at the start of the year], or about 0.12%).
If you don’t believe my assessment, go look at the S&P web page where they report S&P 500 values and returns over time.  Above the regular report is a measure of “total return.”  If you look at the description of that return, you find that “total return” does indeed compute return with dividends included (it appears that dividend reinvestment is also included, but I’m not sure about that).  They do it over a much longer period than a year, so at this point the TR index is almost twice the regular S&P 500 index.
If you add in this adjustment, the result of the analysis changes significantly.  In Weisenthal’s original graph (made so that 2013 represents 1), the ratio of S&P 500 to gold price goes from about 0.1 in 1979 to 1 in 2013. Accepting the same starting point, my computed ratio goes from about 0.1 in 1979 to 2.12 in 2013.  A sharp drop in the ratio (reflecting dubious “flight to safety”) from 2007 to 2009 becomes a drop from about 6 to 2, not a drop from 5 to 0.7. It hasn’t been flat or dropping since then; it’s been climbing by about 6%.  Stocks don’t just beat gold over long periods of time – they beat gold over the short and medium term pretty consistently, and over the long term by a huge amount – try 21.2 times as much.
So I posted this in a comment in Weisenthal’s blog and, getting no response, in Krugman’s blog.  As noted, no one took notice (I would have posted my comments in caps, but it’s not netiquette to scream).  In fact, the debate in the comments proceeded as if the original Weisenthal graphs were the issue – is the government understating inflation data (from “gold bugs”) and therefore there is another gold price surge to come once that becomes clear?  Is the advantage of stocks over gold clear enough, or would an even further surge erase it?
Data snow blindness.
Implications For Y’All
At this point, I have to distinguish this from other sources of problematic analyses that have happened recently – e.g., the Reinhart-Rogoff controversy in economics (which apparently revolved partly around a coding error) and the London-trader miscomputation of risk (partially a human Excel miscalculation).  Those are not really a problem of not noticing that one of your sets of data points is not capturing well what you want it to capture – and they have been exhaustively investigated and debated.
For another example of data snow blindness, I’d like to go back to investments again – the idea of 401(k)s (also applicable to IRAs).  What no one seems to be pointing out is that the expense ratios in those 401(k)s are quite high – I believe they’re still above 2%.  If your employer isn’t paying into your 401(k), this means that you must balance the gain 20 years from now in lower taxes when you cash them out with the loss you get from not sticking an equivalent amount in a Vanguard S&P 500 index fund with an expense ratio of 0.1%.  Even if you pay zero taxes when you cash out, somewhere over a 10-20-year dividing line, you may well lose money on your IRA/401(k) compared to the alternative.  And that’s true of 401(k) bond investments as well (vs. bond index funds). Or so the data suggests – but no one seems to notice this enough to discuss it.
Here’s a few more:  stock risk vs. bonds and everything else. If you have your money in a US S&P 500 index fund, what is your risk?  Over any 20-year period, the stock market as a whole has outperformed any other investment – including inflation.  But what if the stock market collapses drastically and stays collapsed?  If you think about it, that would mean that the US government has collapsed, since it’s the government that insures the banks underpinning economic investment by various mechanisms. So the risk of collapse of the stock market’s 500 largest members is pretty much the same as the risk of the US collapsing – in which case, without that government backing, your money is likely to be worthless (and your gold coins). So why are you “diversifying” beyond the stock market, again?  If you’re planning to start drawing it down or keeping it level within the next 20 years, then some amount of, say, a bond index fund or inflation-protected securities (TIPS) that will keep up with inflation is fine; but the reason for doing anything beyond that is not as clear as it might seem.  Data snow blindness.
How about stock investment returns? Today mutual fund companies compare their results to the S&P 500 – is that the regular S&P 500 or the total return one?  Do they include their expense ratios – above 2% until recently, now (afaik) around 1.5% -- and do you compare them to the Vanguard 0.1% and the Fidelity Spartan 0.2% (plus withholding a bit in cash, which right now earns effectively zero)?  There’s a reason why those index funds outperform around 70% of all other stock investments over a 10-year period, and probably close to 90% over a 30-year-period.
In other words, the real implication of data snow blindness is that it is probably hitting you right in the wallet right now – not necessarily yours, since everyone else seems to be doing it too. Or almost everyone else … Gee, I wonder why the Vanguard S&P 500 index fund is one of the two most popular stock investments today?
Anyway, please think about it.  Me, I’m going to go off and check myself for further signs of data snow blindness. 

Sunday, April 14, 2013

SMR: A Miss Is As Good As Half A Mile


I am used to my favorite business magazine, Sloan Management Review, being light-years ahead in usefulness compared to many others, especially as regards the computing industry.  However, in reading several articles in a row in the Winter 2013 edition, I was struck by a strange discomfort.  After careful thought, I think I have identified the reason: it was the fact that the writers were identifying important things to consider, and completely ignoring other crucial things, without which the analysis was far less accurate and useful. They weren’t miles from reality – what they described really was there – but they seemed at least half a mile away.
So let me go through them, one by one.  I hope that at least my critique will help to close that half a mile in my mind or someone else’s.

Apple Did Not Introduce the Desktop Metaphor


The first article (“How to Use Analogies to Introduce New Ideas”) argues that using analogies effectively can be key to introducing new technologies to the market successfully, and that analogies that stress the familiar and analogies that stress “the novel” should be used appropriately, depending on the technology.
In making this argument, the authors use as their first example Apple’s use of the desktop metaphor for its Mac user interface.  While the article doesn’t explicitly say so, the implication is that its customers were unfamiliar with the desktop metaphor (with its files and folders) before the Mac.  That just ain’t so.  I was there.
I was a programmer in the late ‘70s when word processors first introduced the desktop metaphor.  And, in fact, it took a lot of hard arguing before the people trying to sell the metaphor realized that it was a bad idea to have file cabinets and files, instead of files nested within folders nested within folders.  But by the end of the ‘70s, the idea had taken, and was adopted by PC operating systems well before the Mac arrived in the late ‘80s. 
So why does this matter to the authors’ argument?  The point is that the reason the Mac succeeded was not because it used a familiar analogy to introduce a novel idea – it didn’t.  The novelty in the Mac operating system (although we should really include the Lisa in this account) was the use of object-oriented programming to rapidly produce an icon-based visual interface in which one operated by point and click, drag and drop.  Yes, analogies matter – the experience of the word processor folks shows that.  But, by the same evidence, the usefulness of the new technologies matters equally, whether an analogy is used to grease the skids or not. 
I worry that people will read this article and say, oh, all I need to do in introducing a new technology is make people comfortable with it, or attracted to it, by adding the right analogy.  On the contrary:  I would argue that whenever you do that, you should also work hard at ensuring that the technology is easy to use and useful.  Think about the introduction of the iPhone – little in the way of analogy, loads in the way of demonstration.  That was a novel technology to many – but once seen, very intuitive.  No analogy needed; but the hard work of making it usable – which, imho, was why Jobs succeeded where previous iterations of very similar technology failed – was critical to market success.

What the Future May Bring Is Not Just About Limits to Growth


The next article describes a new book making gloomy forecasts about the next 40 years based on system dynamics (I had a blog on this vs. agility a while back).  He argues that the future is primarily determined by the fact that we are overstretching our resources, and that therefore we will progressively be trying to grow more and more with less and less to grow with and thus with greater and greater starvation, pollution deaths, and other semi-inevitable results.
The problem with this analysis is that he seems to completely fail to understand the science and trends of climate change.  Climate change is not a matter of overstretched resources; it is a matter of a carbon-spewing system running on its own momentum and with much of the disasters ahead already baked in, unaffected by some systems-dynamics shrinkage of population and reduction of resource usage to sustainable levels.  To put it bluntly:  You could shrink the population to one billion right now and reduce some resource usage accordingly, but if you don’t over the next 17 years shrink use of oil, coal, and natural gas by 80-90% from today’s levels and keep it there for at least 200 years, you in all likelihood will still get huge losses of natural resources like farmland from sea-level rise and drought, and billions of deaths from starvation, not to mention the possibility of poisoned air related to ocean acidification.
Frankly, I find this omission distressing, because the book’s author (Randers) is apparently an expert on business and sustainable development, not to mention a professor of “climate strategy.”  If this is what the sustainability movement is typically aiming for, then it is in serious trouble – their goal is not even “sustainable”, since use of resources adequate for the capacity of the earth will not at all matter to businesses in the face of resources such as food shrinking well below the capacity of Earth in these halcyon days.  To put it another way:  first get carbon under control, then talk to me of overstretch.  Zero carbon emissions will at the very least reduce drastically our consumption not only of oil and coal but also related resources; reduction of population and/or generic resource use from 5 billion people equivalents to 1 billion will likely have relatively little effect on oil and coal usage – because that’s not the mole you’re trying to whack.
Randers’ approach, in my strongly held view, would take the sustainability movement down a side track at the moment we can least afford to lose focus.  Please, folks, think about this hard.

Sometimes, Multiple Sizes Do Not Fit All


The next article, “When One Size Does Not Fit All”, argues that companies much choose carefully in supply chain management between focusing on operational efficiency and operational responsiveness (to customers). Unfortunately, the example they use is Dell within the last five years, as it switches from its tried-and-true non-retail consumer-customer rapid-delivery PC model to servicing several types of customer (e.g., businesses) with several types of outlet (e.g., retail) and several types of product (e.g., servers).  The authors argue that the changeover has been a success, once Dell got its act together in developing different focus for different customers and embedding it in the supply chain.
Unfortunately for their story, I had an actual experience with Dell at about the time of the changeover, about three years ago, and my experience makes me question whether Dell really is an example of a success.  Specifically, I ordered a laptop in late November, assuming that (as Dell had always consistently done in the past), I would get it well before Christmas.  On the contrary:  I believe that I got it in early January.  I was in shock.  And yet, the authors’ account seems to imply that there was nothing wrong with Dell’s traditional model at the time of the changeover.
Another example is Dell’s approach to printers. The authors do not even mention printers as a factor in the consumer business, retail or otherwise.  And yet, for a long time, the Dell approach to printers has been an irritant to me.  As I remember, at least for part of the time, Dell only offered Dell inkjets with its Dell PCs and laptops.  That’s all very well, but inkjets need replacement cartridges frequently, and Dell would have you ordering its cartridges online, instead of letting you get them at all sorts of retail stores, like HP.  And when the delivery times start going south …
The point, to me, is that doing each supply chain right as it evolves is just as important as applying the right supply chain to the right customer.  And, I believe, PC World surveys of customer satisfaction bear me out: Dell’s satisfaction ratings in the consumer market, retail or online, have gone downhill and stayed there.  So the prime finding of the article, fit the right “size” of supply chain to the customer, appears to really miss the mark.  What the Dell example tells me is that you had better evolve each supply chain appropriately and keep it working well as the products offered proliferate, or it won’t matter how well your supply chains fit the customer.

Final Thoughts


I could pick nits on the next two articles (I really don’t think focusing on “likes” in Facebook is the most productive way to do brand management, and I seem to gather the idea of “cloud” outsourcing leaves out minor [sarcasm] factors like knowledge of the market among those tapped for these projects), but they seem much less like a frustrating experience in which the authors seem headed in the right direction, only to result in a big miss.  They seem to be off by a few feet, not half a mile.
So I guess my final thought is this.  Especially if you’re focusing on past history, it’s very important to get an inclusive global picture, and make sure your real-world examples don’t tell you anything different once you look at them closely.  There’s a lot of good work in the articles I cited, and yet I’m not convinced their overall impact, if taken seriously, will be positive at all.  Folks, let’s all up our games.  And caveat lector.

 

Monday, April 8, 2013

IBM Information Management’s BLU Acceleration: The Beginning of a Revolution




I have now reviewed IBM’s new Big Data effort, BLU Acceleration, and my take is this:  Yes, it will deliver major performance enhancements in a wide variety of specific Big Data cases – and yes, I do view their claim of 1,000-times acceleration in some cases as credible – but the technology is not a revolutionary radical departure. Rather, it marks the evolutionary beginning of a revolutionary step in database performance and scalability that will be applicable across most Big Data apps – and data-using apps in general.

What follows is my own view of BLU Acceleration, not IBM’s. Click and Clack, the automotive repair show on NPR Radio, used to preface their shtick with “The views expressed on this show are not those of NPR …” or, basically, anyone with half a brain. Similarly, IBM may very well disagree with my key takeaways, as well as with my views on the future directions of the technology.

Still, I am sure that they would agree that a BLU Acceleration-type approach is a key element of the future direction of Big Data technology. I therefore conclude that anyone who wants to plan ahead in Big Data should at least kick the tires of solutions featuring BLU Acceleration in them, to understand the likely immediate and longer-term areas in which it may be applied. And if, in the process, some users choose to buy those solutions, I am sure IBM will be heartbroken – not.

The Rise of the Register


Database users are accustomed to thinking in terms of a storage hierarchy – main memory, sometimes solid-state devices, disk, sometimes tape -- that allows users to get 90% of the performance of an all-main-memory system at 11% of the cost. There is, however, an even higher level of “storage”:  The registers in a processor (not to mention the L1, etc. cache in that processor). There, too, the same tradeoffs apply: they operate at tens to a thousand times the speed of loading a piece of data from main memory, breaking it into parts in order to apply basic operations that amount to a transactional operation, and returning it to main memory.

The key “innovation” of BLU Acceleration is to load entire pieces of data (one or multiple columns, in compressed form) into a register and apply basic operations to it, without needing to decompress it or break it into parts. The usual parallelism between registers via single-instruction-multiple-data-stream techniques and cross-core parallelism adds to the performance advantage. In other words, the speed of the transaction is gated, not by the speed of main memory access, but by the speed of the register.

Now, this is not really revolutionary – we have seen similar approaches before, with bit-mapped indexing. There, data that could be represented as 0s and 1s, such as “yes/no” responses (effectively, a type of columnar storage), could be loaded into a register and basic “and” and “or” operations could be performed on it. The result? Up to 1,000 times speedup for transactions on those types of data. However, BLU Acceleration is able to do this on any type of data – as of now, so long as that data is represented in a columnar format.

Exploring the Virtues of Columnar “Flat Storage”


And here we come to a fascinating implication of BLU Acceleration’s ability to do register-speed processing on columnar data: it allows a columnar-format storage and database to beat an equivalent row-oriented relational storage and database over most of today’s read-only data processing – i.e., most reporting and analytics.

As of now, pre-BLU-Acceleration, there is a rule of thumb when determining when to use columnar or row-oriented relational technology in data warehousing that if more than 1 or 2 columns in a row need to be read in a large-scale transaction, row-oriented performs a little better than column-oriented. This is because any advantage in speed via increased data compression in columnar is more than counterbalanced by its need to seek back and forth across a disk for each needed column (physically stored together in row-oriented storage). However, BLU Acceleration’s shift in emphasis to registers means that the key to its performance is main memory – and main memory is “flat” storage, in which columns can be loaded into the processor simultaneously without the need to seek.

Moreover, one aspect of solid-state disk is that it is really “flat” storage (main-memory-type storage that is slower than main memory but stores the data permanently), sometimes with a disk-access “veneer” attached. In this case, the “veneer” may not be needed; and so, if everything can be stored in a gigabyte of main memory plus a terabyte of solid-state disk, BLU-Acceleration-type columnar beats or matches row-oriented just about every time.

This is especially true because now there is very little need for “indexing” – and so, BLU Acceleration claims to eliminate the complexities of indexing entirely (actually, it apparently does contain an index that gives each column a unique ID). Remember, the purpose of indexing originally in databases was to allow fast access to multiple pieces of data that were mixed and scrambled across a disk – “flat” storage has little need for these things.

A side-effect of eliminating indexing is yet more performance. Gone is the time-consuming optimizer decision-making about which index to use to generate the best performance, and the time-consuming effort to tune and retune the database indexing and storage to minimize sub-optimal performance. By the way, this also raises the question, which I will return to later, as to whether a BLU Acceleration database administrator is needed at all.

Now, there still remain, at present, limits to columnar use, and hence to BLU Acceleration’s advantages. IBM’s technology, it seems, has not yet added the “write to disk” capabilities required for decent update-heavy transactional performance. Also, in very high end applications requiring zettabytes of disk storage, it may well be that row-oriented relational approaches that avoid added disk seeks can compete – in some cases. However, it is my belief that in all cases except those, BLU-Acceleration-type columnar should perform better, and row-oriented relational is not needed.

And we should also note that BLU Acceleration has added one more piece of technology to weight the scale in columnar’s favor: column-based paging. In other words, to load from disk or disk-veneer SSD storage into main memory, one swaps in a “page” defined as containing one column – so that the speed of uploading columns is increased.

The Implications of Distributed Direct-Memory Access


It may seem odd that IBM brought a discussion of its pureScale database clustering solution into a discussion of BLU Acceleration, but to me, there’s a fundamental logic to it that has to do with high-end scalability. Clustering has always been thought of in terms of availability, not scalability – and yet, clustering continues to be the best way to scale up beyond SMP systems. But what does that have to do with BLU Acceleration?

A fundamental advance in shared-disk cluster technology came somewhere around the early ‘90s, when someone took the trouble to figure out how to load-balance across nodes. Before that, a system would simply check if an invoked application was on the node that received the invocation, and, if not, simply use a remote procedure call to invoke a defined copy of that application (or a piece of data) on another node. The load-balancing trick simply figured out which node was least used and invoked the copy of the application on that particular node. Prior to that point, clusters that added a node might see added performance equivalent to 70% of that of a standalone node. With Oracle RAC, an example of load balancing, some users reported perhaps 80% or a bit above that.

It appears that IBM pureScale, based on the mainframe’s Parallel Sysplex architecture, takes that load-balancing trick a bit further: it performs the equivalent of a “direct memory access” to the application or data on the remote node. In other words, it bypasses any network protocols (or, if the app/data is really in the node’s main memory, storage protocols) and goes directly to the application or data as if it was on the local system’s main memory. Result: IBM is talking about users seeing greater than 90% scalability – and I find at least upper 80% scalability something that many implementations may reasonably expect.

Now, let’s go back to our “flat storage” discussion. If the remote direct-memory access really does access main memory or no-veneer solid-state disk, BLU Acceleration’s columnar approach should again best row-oriented technologies, but on a much larger scale. That is, BLU Acceleration plus a pureScale cluster should see raw Big-Data performance advantages as high as the individual nodes will scale, decreasing by less than 10% times the number of nodes beyond that – and now we’re talking thousands of processors and tens of thousands of virtual machines.

And there’s another, more futuristic implication of this approach. If one can apply this kind of “distributed direct-memory access” in a clustered situation, why not in other situations, server-farm grids, for example, or scale-out within particular geographically-contiguous parts of a cloud? There is no doubt that bypassing network and storage protocols can add yet more performance to the BLU Acceleration approach – although it appears that IBM has not yet claimed or begun to implement this type of performance improvement with the technology.

The Wild BLU Yonder


And yet, I have said that BLU Acceleration is not revolutionary; it’s the beginning of a revolution. For the fact is that most of the piece parts of this new technology with mind-blowing performance have already been out there with IBM and others for some time. In-memory databases have long probed flat-storage data processing; IBM has actually seemed until now to be late to the game in columnar databases; I have already noted how bit-mapped indexing delivered thousand-fold performance improvements in certain queries a decade ago. IBM has simply been the first to put all the pieces together, and there is nothing to prevent others from following suit eventually, if they want to.

However, it is also true that IBM appears to be the first major player to deliver on this new approach, and it has a strong hand to play in evolving BLU Acceleration. And that is where the true revolution lies: in evolving this technology to add even more major performance improvements to most if not all data processing use cases. Where might these improvements lie?

One obvious extension is decoupling BLU Acceleration from its present implementation in just two database platforms – DB2, where it delivers the above-noted data warehousing and Big Data performance advantages, and Informix, where it allows an optimizer to feed appropriate time-series analyses to a separate group of servers. This, in turn, would mean ongoing adaptation to multi-vendor-database environments.

Then, there are the directions noted above: performance improvements achieved by eliminating network and storage protocols; extensions to more cases of solid-state disk “flat storage”; addition of update/insert/delete transactional capabilities to at least deliver important performance improvements for “mixed” update/query environments like Web sites; and the usual evolution of compression technology for cramming even more columns into a register.

What about the lack of indexing? Will we see no more need for database administrators? Well, from my point of view, there will be a need for non-flat storage such as disk for at least the medium term, and therefore a need to flesh out BLU Acceleration and the like with indexing schemes and optimization/tuning for disk/tape. Then, of course, there is the need for maintaining data models, schemas, and metadata managers – the subject of an interesting separate discussion at IBM’s BLU Acceleration launch event. But the bulk of present-day administrative heavy lifting may well be on its way out; and that’s a Good Thing.

There’s another potential improvement that I think should also be considered, although it sounds as if it’s not on IBM’s immediate radar screen. When that database transaction is loaded into a register, basic assembler and/or machine-code instructions like “add” and “nor” operate on it. And yet, we are talking about fairly well-defined higher-level database operations (like joins). It seems to me that identifying these higher-level operations and adding them to the machine logic might give a pretty substantial additional performance boost for the hardware/compiler vendor that wishes to tackle the process. Before, when register performance was not critical to Big Data performance, there would have been no reason to do so; now, I believe there probably is.

The IT Bottom Line


Right now, the stated use cases are to apply DB2 or Informix with BLU Acceleration to new or existing Big Data or reporting/analytics implementations – and that’s quite a range of applications. However, as noted, I think that users in general would do well to start to familiarize themselves with this technology right now.

For one thing, I see BLU Acceleration technology as evolving in the next 2-3 years to add a major performance boost to most non-OLTP enterprise solutions, not just Big Data. For another, multi-vendor-database solutions that combine BLU Acceleration columnar technology with row-oriented relational technology (and maybe Hadoop flat-file technology) are likely to be thick on the ground in 2-3 years. So IT needs to figure out how to combine the two effectively, as well as to change its database administration accordingly. By the way, these are happy decisions to make: Lots of upside, and it’s hard to find a downside, no matter how you add BLU Acceleration.

There’s a lot more to discuss about IBM’s new Big Data solutions and its strategy. For IT users, however, I view BLU Acceleration as the biggest piece of the announcement. I can’t see IBM’s technology doing anything other than delivering major value-add in more and more business-critical use cases over the next few years, whether other vendors implement it or not.

So get out there and kick those tires. Hard.

 

Wednesday, April 3, 2013

Thoughts on Big Data and Data Governance


I want to start this piece by giving the most important take-away for IT readers:  They should take care that data governance does not get in the way of Big Data, and not the reverse.

This may seem odd, when I among others have been pointing out for some time that better data cleansing and the like are badly needed in enterprise data strategies in general. But data governance is not just a collection of techniques – it’s a whole philosophy of how to run your data-related IT activities.  Necessarily, the IT department that focuses on data governance emphasizes risk – security risk, risk of bad data, risk of letting parts of the business run amok in their independence and create a complicated tangle of undocumented data relationships.  And that focus on risk can very easily conflict with Big Data’s focus on reward – on proactive identification of new data sources and digging deeper into the relationships between the data sources one has, in order to gain competitive advantage.

While there is not necessarily clear evidence showing that over-focus on data governance can impede Big Data strategies and thereby the success of the organization, there is some suggestive data. Specifically, a recent Sloan Management Review reported that the least successful organizations were those that focused on using Big Data analytics to cut costs and optimize business processes, while the most successful focused their Big Data analytics on understanding their customers better and using that understanding to drive new offerings.  Data governance, as a risk-focused philosophy, is also a cost-focused and internally-focused strategy.  The task of carefully defining and controlling metadata seeks to cut the costs of duplicated effort and unnecessary bug fixes inherent in line-of-business Wild-West data-store proliferation. It therefore can constrain the kind of proliferation of usage of new externally-generated data types like social-media data that yield the greatest Big-Data success for the enterprise.

Who’s To Be Master?

So, if we need to take care that data governance does not interfere with Big Data efforts, and yet things like data cleansing are clearly valuable, how can we coordinate the two better?  I often find it useful in these situations to model the enterprise’s data handling as a sausage factory, in which indescribable pieces of data “meat” are ground together to produce informational “sausage”.  I like to think of it as having five steps (more or less):

*      Data entry – in which the main aim is data accuracy
*      Data consolidation – in which we strive for consistency between the various pieces of data (accuracy plus consistency, in my definition, equals data quality)
*      Data aggregation – in which we seek to widen the scope of users who can see the data
*      Information targeting – in which we seek to make the data into information fitted to particular targeted users
*      Information delivery – in which we seek to get the information to where it is needed in a timely fashion
*      Information analysis – in which we try to present the information to the user in a format that allows maximum in-depth analytics.

Note that data governance as presently defined appears to affect only the first two steps of this process. And yet, my previous studies of the sausage factory suggest that all of the steps should be targeted, as improving only the first two will only offer minor improvements in a process which tends to “lose” ¾ of the valuable information along the way, each step losing quite a bit more.

How does this apply to Big Data?  The most successful users of Big Data, as noted above, actively seek out external data that is dirty and unconsolidated and yet is often more valuable than the organization’s “tamed” data.  Data governance, as the effective front end of the sausage factory, must therefore not exclude this Big Data in the name of data quality – it must find ways of making it “good enough” that it can be fed into the following four steps.  Or, as one particular database administrator told me, “dirty” data should not just be discarded, as it can tell us about what our sausage factory is excluding that we need to know.

Data governance should also not, if at all possible, interfere with the four steps following data quality assurance.  Widening scope widens security risks; but the benefits outweigh the risks. Information delivery that involves a new data type risks creating a “zone of ignorance” where database governors don’t know what their analysts are doing; but the answer is not to exclude the data type until that distant date when it can be properly vetted.

Much of this can be done by using a data discovery or data virtualization tool to discovery new data types and incorporate them in an enterprise metadata store semi-automatically.  But that is not enough; IT needs to ensure that data governance accepts that Big Data exclusion is not an option and that the aim is not pure data, but rather the best balance of valuable Big Data and data quality.

In one of the Alice in Wonderland books, a character uses the word “glory” in a very odd way, and Alice objects that he should not be allowed to.  “The question is,” the character replies, “Who’s to be master, you or the word?”  In a similar way, users of data governance and Big Data need to understand that you with your need for big Data customer insights from the outside world need to be master, not the data governance enforcer.