Friday, April 19, 2013

Data Snow Blindness

I am taking a break from everything else that’s going on today to tell the story of an error in data analysis and presentation, seriously affecting the strength of the argument being made, that to me is jaw-dropping – not the error itself, but people’s reaction to its being pointed out.  And that reaction is:
Nothing.
Seriously.  Nothing.  Nothing by commenters on either blog where the original analysis and reactions is prominently mentioned (Weisenthal and Krugman).  Nothing by the bloggers themselves. No change in the analysis as presented on the blog.  Dead silence. Debate continues to be waged based on the uncorrected blog data.
I have racked my brain as to why this might be.  The best reason I can think of is that most people are so focused on the “bright” parts of the data as presented, the fact that the two measures over time involved show a clear relationship, that they ignore the fact that one of the two measures is not quite the correct one to use.  It is, if you will, an example of blindness like the blindness caused by trying to look ahead into a landscape of snow reflecting the sun fiercely – the well-known phenomenon of “snow blindness.”  Data snow blindness.  And, as in the case the other snow blindness, you become truly blind – you simply don’t notice the information indicating that the analysis is off.
The Story
The story begins when Weisenthal reacts to an ongoing debate on the merits of investment in gold by doing a couple of graphs comparing the value of the S&P 500 index over time (1979 and 2007 to now) to the value of gold in the markets ($/troy ounce). His point – a valid one – is that even with recent fluctuations in gold’s price, investment in stocks outperforms investment in gold over time.
Now, as I found out in doing my post-MIT-Sloan-School-of-Management research on investment theory for my own use, it turns out that the S&P 500 is a price-only measure.  That is, the typical value of the S&P 500 that everyone quotes does not include the value of dividends that the companies issue over time, and it doesn’t include the reinvestment of those dividends.  As far as I can tell from observations over the past 15 years as well as from previous data, the dividends themselves average about 2.3% per year, and the reinvestment adds another 0.1% per year (the “geometric mean” of returns, the correct way to measure, suggests that at least until about 2007 (a period of maybe 90 years), the growth in the S&P 500 was about 10.8% per year, which meant that reinvestment of dividends over the course of a year would yield about 2.3% times 10.8% times ½ [to reflect the fact that dividends don’t all occur at the start of the year], or about 0.12%).
If you don’t believe my assessment, go look at the S&P web page where they report S&P 500 values and returns over time.  Above the regular report is a measure of “total return.”  If you look at the description of that return, you find that “total return” does indeed compute return with dividends included (it appears that dividend reinvestment is also included, but I’m not sure about that).  They do it over a much longer period than a year, so at this point the TR index is almost twice the regular S&P 500 index.
If you add in this adjustment, the result of the analysis changes significantly.  In Weisenthal’s original graph (made so that 2013 represents 1), the ratio of S&P 500 to gold price goes from about 0.1 in 1979 to 1 in 2013. Accepting the same starting point, my computed ratio goes from about 0.1 in 1979 to 2.12 in 2013.  A sharp drop in the ratio (reflecting dubious “flight to safety”) from 2007 to 2009 becomes a drop from about 6 to 2, not a drop from 5 to 0.7. It hasn’t been flat or dropping since then; it’s been climbing by about 6%.  Stocks don’t just beat gold over long periods of time – they beat gold over the short and medium term pretty consistently, and over the long term by a huge amount – try 21.2 times as much.
So I posted this in a comment in Weisenthal’s blog and, getting no response, in Krugman’s blog.  As noted, no one took notice (I would have posted my comments in caps, but it’s not netiquette to scream).  In fact, the debate in the comments proceeded as if the original Weisenthal graphs were the issue – is the government understating inflation data (from “gold bugs”) and therefore there is another gold price surge to come once that becomes clear?  Is the advantage of stocks over gold clear enough, or would an even further surge erase it?
Data snow blindness.
Implications For Y’All
At this point, I have to distinguish this from other sources of problematic analyses that have happened recently – e.g., the Reinhart-Rogoff controversy in economics (which apparently revolved partly around a coding error) and the London-trader miscomputation of risk (partially a human Excel miscalculation).  Those are not really a problem of not noticing that one of your sets of data points is not capturing well what you want it to capture – and they have been exhaustively investigated and debated.
For another example of data snow blindness, I’d like to go back to investments again – the idea of 401(k)s (also applicable to IRAs).  What no one seems to be pointing out is that the expense ratios in those 401(k)s are quite high – I believe they’re still above 2%.  If your employer isn’t paying into your 401(k), this means that you must balance the gain 20 years from now in lower taxes when you cash them out with the loss you get from not sticking an equivalent amount in a Vanguard S&P 500 index fund with an expense ratio of 0.1%.  Even if you pay zero taxes when you cash out, somewhere over a 10-20-year dividing line, you may well lose money on your IRA/401(k) compared to the alternative.  And that’s true of 401(k) bond investments as well (vs. bond index funds). Or so the data suggests – but no one seems to notice this enough to discuss it.
Here’s a few more:  stock risk vs. bonds and everything else. If you have your money in a US S&P 500 index fund, what is your risk?  Over any 20-year period, the stock market as a whole has outperformed any other investment – including inflation.  But what if the stock market collapses drastically and stays collapsed?  If you think about it, that would mean that the US government has collapsed, since it’s the government that insures the banks underpinning economic investment by various mechanisms. So the risk of collapse of the stock market’s 500 largest members is pretty much the same as the risk of the US collapsing – in which case, without that government backing, your money is likely to be worthless (and your gold coins). So why are you “diversifying” beyond the stock market, again?  If you’re planning to start drawing it down or keeping it level within the next 20 years, then some amount of, say, a bond index fund or inflation-protected securities (TIPS) that will keep up with inflation is fine; but the reason for doing anything beyond that is not as clear as it might seem.  Data snow blindness.
How about stock investment returns? Today mutual fund companies compare their results to the S&P 500 – is that the regular S&P 500 or the total return one?  Do they include their expense ratios – above 2% until recently, now (afaik) around 1.5% -- and do you compare them to the Vanguard 0.1% and the Fidelity Spartan 0.2% (plus withholding a bit in cash, which right now earns effectively zero)?  There’s a reason why those index funds outperform around 70% of all other stock investments over a 10-year period, and probably close to 90% over a 30-year-period.
In other words, the real implication of data snow blindness is that it is probably hitting you right in the wallet right now – not necessarily yours, since everyone else seems to be doing it too. Or almost everyone else … Gee, I wonder why the Vanguard S&P 500 index fund is one of the two most popular stock investments today?
Anyway, please think about it.  Me, I’m going to go off and check myself for further signs of data snow blindness. 

Sunday, April 14, 2013

SMR: A Miss Is As Good As Half A Mile


I am used to my favorite business magazine, Sloan Management Review, being light-years ahead in usefulness compared to many others, especially as regards the computing industry.  However, in reading several articles in a row in the Winter 2013 edition, I was struck by a strange discomfort.  After careful thought, I think I have identified the reason: it was the fact that the writers were identifying important things to consider, and completely ignoring other crucial things, without which the analysis was far less accurate and useful. They weren’t miles from reality – what they described really was there – but they seemed at least half a mile away.
So let me go through them, one by one.  I hope that at least my critique will help to close that half a mile in my mind or someone else’s.

Apple Did Not Introduce the Desktop Metaphor


The first article (“How to Use Analogies to Introduce New Ideas”) argues that using analogies effectively can be key to introducing new technologies to the market successfully, and that analogies that stress the familiar and analogies that stress “the novel” should be used appropriately, depending on the technology.
In making this argument, the authors use as their first example Apple’s use of the desktop metaphor for its Mac user interface.  While the article doesn’t explicitly say so, the implication is that its customers were unfamiliar with the desktop metaphor (with its files and folders) before the Mac.  That just ain’t so.  I was there.
I was a programmer in the late ‘70s when word processors first introduced the desktop metaphor.  And, in fact, it took a lot of hard arguing before the people trying to sell the metaphor realized that it was a bad idea to have file cabinets and files, instead of files nested within folders nested within folders.  But by the end of the ‘70s, the idea had taken, and was adopted by PC operating systems well before the Mac arrived in the late ‘80s. 
So why does this matter to the authors’ argument?  The point is that the reason the Mac succeeded was not because it used a familiar analogy to introduce a novel idea – it didn’t.  The novelty in the Mac operating system (although we should really include the Lisa in this account) was the use of object-oriented programming to rapidly produce an icon-based visual interface in which one operated by point and click, drag and drop.  Yes, analogies matter – the experience of the word processor folks shows that.  But, by the same evidence, the usefulness of the new technologies matters equally, whether an analogy is used to grease the skids or not. 
I worry that people will read this article and say, oh, all I need to do in introducing a new technology is make people comfortable with it, or attracted to it, by adding the right analogy.  On the contrary:  I would argue that whenever you do that, you should also work hard at ensuring that the technology is easy to use and useful.  Think about the introduction of the iPhone – little in the way of analogy, loads in the way of demonstration.  That was a novel technology to many – but once seen, very intuitive.  No analogy needed; but the hard work of making it usable – which, imho, was why Jobs succeeded where previous iterations of very similar technology failed – was critical to market success.

What the Future May Bring Is Not Just About Limits to Growth


The next article describes a new book making gloomy forecasts about the next 40 years based on system dynamics (I had a blog on this vs. agility a while back).  He argues that the future is primarily determined by the fact that we are overstretching our resources, and that therefore we will progressively be trying to grow more and more with less and less to grow with and thus with greater and greater starvation, pollution deaths, and other semi-inevitable results.
The problem with this analysis is that he seems to completely fail to understand the science and trends of climate change.  Climate change is not a matter of overstretched resources; it is a matter of a carbon-spewing system running on its own momentum and with much of the disasters ahead already baked in, unaffected by some systems-dynamics shrinkage of population and reduction of resource usage to sustainable levels.  To put it bluntly:  You could shrink the population to one billion right now and reduce some resource usage accordingly, but if you don’t over the next 17 years shrink use of oil, coal, and natural gas by 80-90% from today’s levels and keep it there for at least 200 years, you in all likelihood will still get huge losses of natural resources like farmland from sea-level rise and drought, and billions of deaths from starvation, not to mention the possibility of poisoned air related to ocean acidification.
Frankly, I find this omission distressing, because the book’s author (Randers) is apparently an expert on business and sustainable development, not to mention a professor of “climate strategy.”  If this is what the sustainability movement is typically aiming for, then it is in serious trouble – their goal is not even “sustainable”, since use of resources adequate for the capacity of the earth will not at all matter to businesses in the face of resources such as food shrinking well below the capacity of Earth in these halcyon days.  To put it another way:  first get carbon under control, then talk to me of overstretch.  Zero carbon emissions will at the very least reduce drastically our consumption not only of oil and coal but also related resources; reduction of population and/or generic resource use from 5 billion people equivalents to 1 billion will likely have relatively little effect on oil and coal usage – because that’s not the mole you’re trying to whack.
Randers’ approach, in my strongly held view, would take the sustainability movement down a side track at the moment we can least afford to lose focus.  Please, folks, think about this hard.

Sometimes, Multiple Sizes Do Not Fit All


The next article, “When One Size Does Not Fit All”, argues that companies much choose carefully in supply chain management between focusing on operational efficiency and operational responsiveness (to customers). Unfortunately, the example they use is Dell within the last five years, as it switches from its tried-and-true non-retail consumer-customer rapid-delivery PC model to servicing several types of customer (e.g., businesses) with several types of outlet (e.g., retail) and several types of product (e.g., servers).  The authors argue that the changeover has been a success, once Dell got its act together in developing different focus for different customers and embedding it in the supply chain.
Unfortunately for their story, I had an actual experience with Dell at about the time of the changeover, about three years ago, and my experience makes me question whether Dell really is an example of a success.  Specifically, I ordered a laptop in late November, assuming that (as Dell had always consistently done in the past), I would get it well before Christmas.  On the contrary:  I believe that I got it in early January.  I was in shock.  And yet, the authors’ account seems to imply that there was nothing wrong with Dell’s traditional model at the time of the changeover.
Another example is Dell’s approach to printers. The authors do not even mention printers as a factor in the consumer business, retail or otherwise.  And yet, for a long time, the Dell approach to printers has been an irritant to me.  As I remember, at least for part of the time, Dell only offered Dell inkjets with its Dell PCs and laptops.  That’s all very well, but inkjets need replacement cartridges frequently, and Dell would have you ordering its cartridges online, instead of letting you get them at all sorts of retail stores, like HP.  And when the delivery times start going south …
The point, to me, is that doing each supply chain right as it evolves is just as important as applying the right supply chain to the right customer.  And, I believe, PC World surveys of customer satisfaction bear me out: Dell’s satisfaction ratings in the consumer market, retail or online, have gone downhill and stayed there.  So the prime finding of the article, fit the right “size” of supply chain to the customer, appears to really miss the mark.  What the Dell example tells me is that you had better evolve each supply chain appropriately and keep it working well as the products offered proliferate, or it won’t matter how well your supply chains fit the customer.

Final Thoughts


I could pick nits on the next two articles (I really don’t think focusing on “likes” in Facebook is the most productive way to do brand management, and I seem to gather the idea of “cloud” outsourcing leaves out minor [sarcasm] factors like knowledge of the market among those tapped for these projects), but they seem much less like a frustrating experience in which the authors seem headed in the right direction, only to result in a big miss.  They seem to be off by a few feet, not half a mile.
So I guess my final thought is this.  Especially if you’re focusing on past history, it’s very important to get an inclusive global picture, and make sure your real-world examples don’t tell you anything different once you look at them closely.  There’s a lot of good work in the articles I cited, and yet I’m not convinced their overall impact, if taken seriously, will be positive at all.  Folks, let’s all up our games.  And caveat lector.

 

Monday, April 8, 2013

IBM Information Management’s BLU Acceleration: Not a Big Data Revolution, But the Beginning of a Revolution


I have now reviewed IBM’s major new announcement in its Big Data effort, BLU Acceleration, and my take is this:  yes, it will deliver major new performance enhancements in a wide variety of specific Big Data cases – and yes, I do view their claim of 1,000-times acceleration in some cases as credible – but the technology is not revolutionary.  Rather, it marks the beginnings of a revolutionary step in database performance and scalability that will be applicable across most Big Data apps – and data-using apps in general.
 
What follows is my own view of BLU Acceleration, not IBM’s.  Click and Clack, the automotive repair show on NPR Radio, used to preface their shtick with “The views expressed on this show are not those of NPR …” or, basically, anyone with half a brain.  Similarly, IBM may very well disagree with my key takeaways, as well as with my views on the future directions of the technology. 
 
Still, I am sure that they would agree that a BLU Acceleration-type approach is a key element of the future direction of Big Data technology. I therefore conclude that anyone who wants to plan ahead in Big Data should at least kick the tires of solutions with BLU Acceleration in them, to understand the likely immediate and longer-term areas in which it may be applied.  And if, in the process, some users choose to buy IBM solutions, I am sure that IBM would be heartbroken – not.

The Rise of the Register


Database users are accustomed to thinking of a storage hierarchy – main memory, sometimes solid-state devices, disk, sometimes tape -- that allows users to get 90% of the performance of an all-main-memory system at 11% of the cost.  There is, however, an even higher level of “storage”:  the registers in a processor (not to mention the L1, etc. cache in that processor).  There, too, the same tradeoffs apply:  they operate at tens to a thousand times the speed of loading a piece of data from main memory, breaking it into parts in order to apply basic operations on it that amount to a transactional operation, and returning it to main memory. 
 
The key “innovation” of BLU Acceleration is to load entire pieces of data (one or multiple columns, in compressed form) into a register and apply basic operations to it, without needing to decompress it or break it into parts.  The usual parallelism between registers via single-instruction-multiple-data-stream techniques and cross-core parallelism adds to the performance advantage.  In other words, the speed of the transaction is gated, not by the speed of main memory access, but by the speed of the register.
 
Now, this is not really revolutionary – we have seen this before, with bit-mapped indexing.  There, data that could be represented as 0s and 1s, such as “yes/no” responses (effectively, a type of columnar storage), could be loaded into a register and basic “and” and “or” operations could be performed on it.  The result?  Up to 1,000 times speedup for transactions on those types of data.  However, BLU Acceleration is able to do this on any type of data – as of now, as long as that data is represented in a columnar format.

Exploring the Virtues of Columnar “Flat Storage”


And here we come to a fascinating implication of BLU Acceleration’s ability to do register-speed processing on columnar data:  it allows a columnar-format storage and database to beat an equivalent row-oriented relational storage and database over most of today’s read-only data processing – i.e., most reporting and analytics.
 
As of now, there is a rule of thumb that if more than 1 or 2 columns in a row need to be read in a large-scale transaction, row-oriented performs a little better than column-oriented, because columnar’s advantage in speed via increased data compression is more than counterbalanced by the need in columnar to seek back and forth across a disk for each needed column (physically stored together in row-oriented storage).  However, the shift in emphasis to registers means that the key to performance is main memory – and main memory is “flat” storage, in which columns can be loaded into the processor simultaneously without the need to seek.  Moreover, one aspect of solid-state disk is that it is really “flat” storage (main-memory-type storage that is slower than main memory but stores the data permanently), sometimes with a disk-access “veneer” attached.  In this case, the “veneer” may not be needed; and so, if everything can be stored in a gigabyte of main memory plus a terabyte of solid-state disk, columnar beats or matches row-oriented just about every time.
 
This is especially true because now there is very little need for “indexing” – and so, BLU Acceleration claims to eliminate the complexities of indexing entirely (actually, it apparently does contain an index that gives each column a unique ID).  Remember, the purpose of indexing originally in databases was to allow fast access to multiple pieces of data that were mixed and scrambled across a disk – “flat” storage has little need for these things. 
 
A side-effect of eliminating indexing is yet more performance.  Gone is the time-consuming optimizer decision-making about which index to use to generate the best performance, and the time-consuming effort to tune and retune the database indexing and storage to minimize sub-optimal performance. By the way, this raises the question, which I will return to later, as to whether a BLU Acceleration database administrator is needed at all.
 
Now, there still remain, at present, limits to columnar use, and hence to BLU Acceleration advantages.  The BLU Acceleration technology, it seems, has not yet added the “write to disk” capabilities required for decent update-heavy transactional performance.  Also, where there are very high end applications requiring massive disk storage and involving lots of 3-column or more reads, it may well be that row-oriented can compete.  But we should also note that BLU Acceleration has added one more piece of technology to weight the scale in columnar’s favor:  column-based paging.  In other words, to load from disk or disk-veneer SSD storage into main memory, one swaps in a “page” defined as containing one or multiple columns – so that the speed of upload of columns is increased.

The Implications of Distributed Direct-Memory Access


It may seem odd that IBM brought a discussion of its pureScale database clustering solution into a discussion of BLU Acceleration, but to me, there’s a fundamental logic to it – and it has to do with high-end scalability.  Clustering has always been thought of as about availability, not scalability – and yet, clustering continues to be the best way to scale up beyond SMP systems.  But what does that have to do with BLU Acceleration?
 
A fundamental advance in shared-disk cluster technology came somewhere around the early ‘90s, when someone took the trouble to figure out how to load-balance across nodes.  Before, a system would simply check if an invoked application was on the node that received the invocation, and, if not, simply use a remote procedure call to invoke a defined copy of that application (or a piece of data) on another node.  The load-balancing trick simply figured out which node was least used and invoked the copy of the application on that particular node. Before, clusters that added a node might see added performance equivalent to 70% of that of a standalone node. With Oracle RAC, an example of load balancing, some users reported perhaps 80% or a bit above that.
 
It appears that IBM pureScale, based on the mainframe’s Parallel Sysplex architecture, takes that load-balancing trick a bit further:  it performs the equivalent of a “direct memory access” to the application or data on the remote node.  In other words, it bypasses any network (or, if the app/data is really in the node’s main memory, storage) protocols and goes directly to the application or data as if it was on the local system’s main memory.  Result:  IBM is talking about users seeing greater than 90% scalability – and I find at least upper 80% scalability something that many implementations may reasonably expect. 
 
Now, let’s go back to our “flat storage” discussion.  If the remote direct-memory access really does access main memory or no-veneer solid-state disk, BLU Acceleration’s columnar approach should again best row-oriented, but on a much larger scale.  That is, BLU Acceleration plus a pureScale cluster should see raw Big-Data performance advantages as high as the individual nodes will scale, decreasing by less than 10% times the number of nodes beyond that – and now we’re talking thousands of processors and tens of thousands of virtual machines.
 
And there’s another, more futuristic implication of this approach.  If one can apply this kind of “distributed direct-memory access” in a clustered situation, why not in other situations, server-farm grids, for example, or scale-out within particular geographically-contiguous parts of a cloud?  There is no doubt that bypassing network and storage protocols can add yet more performance to the BLU Acceleration approach – although it appears that BLU Acceleration itself has not yet begun to implement this type of performance improvement.

The Wild BLU Yonder


And yet, I have said that BLU Acceleration is not revolutionary; it’s the beginning of a revolution. For the fact is that most of the piece parts of this new technology with mind-blowing performance have already been out there with IBM and others for some time.  In-memory databases have long probed flat-storage data processing, IBM has actually seemed until now to be late to the game in columnar databases, and I have already noted how bit-mapped indexing delivered thousand-fold performance improvements in certain queries a decade ago. IBM has simply been the first to put all the pieces together, and afaik there is nothing to prevent others from following suit eventually, if they want to.
 
However, it is also true that IBM appears to be the first major player in this new technology approach, and it has a strong hand to play in evolving BLU Acceleration.  And that is where the true revolution lies:  in evolving this technology to add even more major performance improvements to most if not all data processing use cases.  Where might these improvements lie?
 
An obvious extension is decoupling BLU Acceleration from its present implementation in just two database platforms – DB2, where it delivers the above-noted data warehousing and Big Data performance advantages, and Informix, where it allows an optimizer to feed appropriate time-series analyses to a separate group of servers.  This, in turn, would mean ongoing adaptation to multi-vendor-database environments.
 
Then, there are the directions noted above:  performance improvements by eliminating network and storage protocols; extension to more cases of solid-state disk “flat storage”; addition of update/insert/delete transactional capabilities to at least deliver important performance improvements for “mixed” update/query environments like Web sites; and the usual evolution of compression technology for cramming even more columns into a register.
 
What about the lack of indexing?  Will we see no more need for database administrators?  Well, from my point of view, there will be a need for non-flat storage such as disk for at least the medium-term future, and therefore a need to flesh out BLU Acceleration and the like with indexing schemes and optimization/tuning for disk/tape. Then, of course, there is the need for maintaining data models, schemas, and metadata managers – the subject of an interesting separate discussion at today’s conference.  But the bulk of present-day administrative heavy lifting may well be on its way out; and that’s a Good Thing.
 
There’s another potential improvement that I think should be considered, although it sounds as if it’s not on IBM’s immediate radar screen.  When that database transaction is loaded into a register, basic assembler and/or machine-code instructions like “add” and “nor” operate on it. And yet, we are talking about fairly well-defined higher-level database operations (like joins). It seems to me that identifying these higher-level operations and adding them to the machine logic might give a pretty substantial additional performance boost for the hardware/compiler vendor that wishes to tackle it. Before, when register performance was not critical to Big Data performance, there would have been no reason to do so; now, I would guess, there is.
 
The IT Bottom Line

Right now, the use cases are to apply DB2 or Informix with BLU Acceleration to new or existing Big Data or reporting/analytics implementations – and that’s quite a range of applications.  However, as noted, I think that users in general should start to familiarize themselves with this technology right now.
 
For one thing, I see BLU Acceleration technology as evolving in the next 2-3 years to add a major performance boost to most non-OLTP enterprise solutions, not just Big Data.  For another, multi-vendor-database solutions that combine BLU Acceleration columnar technology with row-oriented relational technology (and maybe Hadoop flat-file technology) are likely to be thick on the ground in 2-3 years, and IT needs to figure out how to combine the two effectively, as well as to change its database administration accordingly.  By the way, these are happy decisions to make:  Lots of upside, and it’s hard to find a downside, no matter how you add BLU Acceleration.
 
There’s a lot more to discuss about IBM’s new Big Data solutions and its strategy. For IT users, however, I view BLU Acceleration as the biggest announcement.  I can’t see this doing anything other than delivering major value-add in more and more business-critical use cases over the next few years, whether other vendors implement it or not. Kick those tires.  Hard.

 

 

 

Wednesday, April 3, 2013

Thoughts on Big Data and Data Governance


I want to start this piece by giving the most important take-away for IT readers:  They should take care that data governance does not get in the way of Big Data, and not the reverse.

This may seem odd, when I among others have been pointing out for some time that better data cleansing and the like are badly needed in enterprise data strategies in general. But data governance is not just a collection of techniques – it’s a whole philosophy of how to run your data-related IT activities.  Necessarily, the IT department that focuses on data governance emphasizes risk – security risk, risk of bad data, risk of letting parts of the business run amok in their independence and create a complicated tangle of undocumented data relationships.  And that focus on risk can very easily conflict with Big Data’s focus on reward – on proactive identification of new data sources and digging deeper into the relationships between the data sources one has, in order to gain competitive advantage.

While there is not necessarily clear evidence showing that over-focus on data governance can impede Big Data strategies and thereby the success of the organization, there is some suggestive data. Specifically, a recent Sloan Management Review reported that the least successful organizations were those that focused on using Big Data analytics to cut costs and optimize business processes, while the most successful focused their Big Data analytics on understanding their customers better and using that understanding to drive new offerings.  Data governance, as a risk-focused philosophy, is also a cost-focused and internally-focused strategy.  The task of carefully defining and controlling metadata seeks to cut the costs of duplicated effort and unnecessary bug fixes inherent in line-of-business Wild-West data-store proliferation. It therefore can constrain the kind of proliferation of usage of new externally-generated data types like social-media data that yield the greatest Big-Data success for the enterprise.

Who’s To Be Master?

So, if we need to take care that data governance does not interfere with Big Data efforts, and yet things like data cleansing are clearly valuable, how can we coordinate the two better?  I often find it useful in these situations to model the enterprise’s data handling as a sausage factory, in which indescribable pieces of data “meat” are ground together to produce informational “sausage”.  I like to think of it as having five steps (more or less):

*      Data entry – in which the main aim is data accuracy
*      Data consolidation – in which we strive for consistency between the various pieces of data (accuracy plus consistency, in my definition, equals data quality)
*      Data aggregation – in which we seek to widen the scope of users who can see the data
*      Information targeting – in which we seek to make the data into information fitted to particular targeted users
*      Information delivery – in which we seek to get the information to where it is needed in a timely fashion
*      Information analysis – in which we try to present the information to the user in a format that allows maximum in-depth analytics.

Note that data governance as presently defined appears to affect only the first two steps of this process. And yet, my previous studies of the sausage factory suggest that all of the steps should be targeted, as improving only the first two will only offer minor improvements in a process which tends to “lose” ¾ of the valuable information along the way, each step losing quite a bit more.

How does this apply to Big Data?  The most successful users of Big Data, as noted above, actively seek out external data that is dirty and unconsolidated and yet is often more valuable than the organization’s “tamed” data.  Data governance, as the effective front end of the sausage factory, must therefore not exclude this Big Data in the name of data quality – it must find ways of making it “good enough” that it can be fed into the following four steps.  Or, as one particular database administrator told me, “dirty” data should not just be discarded, as it can tell us about what our sausage factory is excluding that we need to know.

Data governance should also not, if at all possible, interfere with the four steps following data quality assurance.  Widening scope widens security risks; but the benefits outweigh the risks. Information delivery that involves a new data type risks creating a “zone of ignorance” where database governors don’t know what their analysts are doing; but the answer is not to exclude the data type until that distant date when it can be properly vetted.

Much of this can be done by using a data discovery or data virtualization tool to discovery new data types and incorporate them in an enterprise metadata store semi-automatically.  But that is not enough; IT needs to ensure that data governance accepts that Big Data exclusion is not an option and that the aim is not pure data, but rather the best balance of valuable Big Data and data quality.

In one of the Alice in Wonderland books, a character uses the word “glory” in a very odd way, and Alice objects that he should not be allowed to.  “The question is,” the character replies, “Who’s to be master, you or the word?”  In a similar way, users of data governance and Big Data need to understand that you with your need for big Data customer insights from the outside world need to be master, not the data governance enforcer. 


Monday, March 4, 2013

Didja Ever Wonder About Legislative Drafting Software?


Do you remember Andy Rooney on 60 Minutes?  Every once in a while, he would ask in a plaintive tone, “Did you ever wonder why … ? He really didn’t know the answers to his questions, but as an outside observer some things just didn’t seem to make sense.  And that, in turn, makes me remember one area in which I wondered why without really knowing the subject:  legislative drafting software.

More than thirty years after I first thought about it, the idea of software supporting legislative drafting is not likely to ever take off, even if I was sure that it made sense.  However, it’s one of those things that I have never seen explained or even tried (I did a quick Google search on the subject and turned up no examples of such software).  And if it did make sense, it seems to me that it could make a major difference both in the effectiveness of laws and in the effectiveness of all of us in dealing with them.

So let me lay out the problem and my solution as I see it, and then anyone who cares can explain why this makes sense – or doesn’t.  And if no one answers, I’ll just keep on wondering.

The Law Is a Computer Language

As it happens, I had a Dad who as a law professor gave me a little insight into laws and the process of drafting them, through occasional readings of material from the Legislative Drafting Fund that he ran to help Congress draft bills and laws.  And as I grew more acquainted, and also got some training in computer science, I realized the similarities between writing a program and writing a law.

The words and phrases used in laws seem ambiguous, but are given meanings as specific as possible (‘semantics’) to allow clear interpretation of the laws.  The law itself can be thought of as a giant “case” statement:  In case 1, do this, in case 2, do this, etc., where each “case” applies a specific test, and the sum of the “cases” is supposed to cover all possible circumstances to which the law can be applied.  The “dos” within these cases involve trials, degrees of guilt, tests of innocence, etc.

So what’s the problem?  The problem, as I see it, is that all is dependent on us fallible human beings.  Laws fail to cover cases; they use the wrong language to cover a case, so that the judge must either follow a mistaken law or figure out “what it should have meant”; or new needs arise and it is hard to determine if the existing law is adequate or not.  Moreover, the proliferation of laws to fix laws makes “ignorance of the law” the norm among non-legal types and surprisingly frequent across specialties even in the legal profession – causing unnecessary legal violations.

No such problems, potentially, exist with computer languages.  Computer languages allow semi-automatic testing to see if all cases are covered, if the program performs as intended, and if code from an existing program can be applied to a new need.  The computing industry has decades of experience in this; the legal industry has none – unless you count training individuals in legislative drafting for each state or for the Federal government, periodically to be updated as new tricks arise.

Programming Instead of “Precedent” and “Common Sense”

But the implications of creating testable software mimicking both existing laws and proposed ones goes beyond simply legislative drafting.  As I understand it (which is to say, far too little), the ways that laws typically adapt to ever-changing circumstances are by “precedent” and “common sense”. “Precedent” I take to mean an initial take on how something not clear from the law’s text should be handled, then used as a guide to future decisions (unless, rarely, overturned). “Common sense” I take to mean in Holmes’ sense of “The life [i.e., the ability to grow] of the law is common sense”:  we assume that the drafters of laws would be OK with our interpreting them as providing general, adaptable guidelines, even if a strict reading of the law would seem to contradict some extensions of the law.

These kludges seem to carry us past much of the excesses of the legislative process, although they do not necessarily completely solve the rigidity of law.  However, a closer look suggests that we could do better in adapting laws to new circumstances than simply precedent and common sense.  For one thing, it is clear that courts are now undertaking, without the requisite expertise, to assess things like evolution and climate change, not to mention economics.  More fundamentally, there is simply no mechanism for accepting the way that science may come to contradict a law – as in the purported case where Indiana passed a law setting pi to 3 (all right, that’s mathematics, not science, but still …).  Science is not common sense, but it is far more soundly based in reality.  And precedent itself may set in stone things that need more frequent revision.

While testable software that approximates laws and draft laws cannot overcome these problems, it can provide a means of detecting them.  In so doing, it allows greater scope for creativity in finding ways to overcome the limitations of precedent and common sense.  For example, using a body of legislative “programs” as data, one could imagine a scientific fact and “insert” it in a variety of cases to see how laws handle it.

Likewise, one could imagine a corporation or individual running a proposed action against this “program” data instead of just consulting a corporate or personal lawyer.  Would this end the need for these?  Not at all; rather, they simply would not be wasting their time and their employer’s money with straightforward answers to legal questions.

So What About It?

I don’t know; all of the above seems reasonable to me.  So why hasn’t anyone tried?  Or have they, and have they concluded that it’s not worth the effort?  I await the answer, not with baited breath, but with a sense that I’ll probably never know.

Friday, February 22, 2013

Parasoft and "Service Virtualization" Testing: A Good Idea


Recently there passed across my desk a white paper sponsored by Parasoft about the idea of applying what they called “service virtualization” to software testing.  Ordinarily, I find that “we’ve been there and done that” for much of the material in most of the white papers like this that I see.  In this case, however, I think that the idea Parasoft describes is (a) pretty new, (b) applicable to many software-development situations, and (c) quite valuable if effectively done.

The Problem

The problem to be solved, as I understand it – and my own experience and conversations with development folks suggests that it does indeed happen frequently these days – is that in the later stages of software testing, of dependency and volume testing shortly before version or product deployment, one or more key applications not involved in the software development or upgrade but with “interaction effects” is effectively unavailable for testing in a timely fashion. It may be a run-the-business ERP application for which stress testing crowds out the needed customer-transaction processing.  It may be a poorly documented mission-critical legacy application for which creating a “sandbox” is impractical. You can probably think of many other cases.

In fact, I think I ran across an example recently.  It went like this:  a software company selling a customer-facing application started up about five years ago.  Over five years of success, they ran that customer-facing application 24x7, with weekly maintenance halts for a couple of hours and 4-6-hour halts for major upgrades.  All very nice, all very successful, as revenue ramped up nicely. 

Then (reading between the lines), recently they realized that they had not upgraded their back-end billing and accounting systems that fed off the application, and these were increasingly inefficient and causing problems with customer satisfaction.  So they tested the new solutions in isolation in a “sandbox”, and then scheduled a full 12 hours of downtime on the app to install the new solutions – without “sandboxing” the back-end and front-end solutions working together first.

Everything apparently went fine until they started up a “test run” of the back-end and front-end solutions working in sync.  At that point, not only did the test fail, but it also created problems with the “snapshot” of front-end data that they had started from.  So they had to repeatedly reload the start point and do incremental testing on the back-end systems. In the end, they took more than two days (complete with anguished screams from customers) to add some changes to back-end systems and make the customer-facing application available again; and it took several more days before the rest of the back-end systems were available in the new form. As the white paper notes, in planning final testing of new software, companies can often be willing to skip integration testing involving interdependent unchanged software; and the consequences can be quite serious.

The Idea

Probably to cash in on the popularity of “virtualization”, the white paper calls the idea proposed to deal with this problem “service virtualization.”  To my eyes, the best description is “script-based dependent software emulation.” In other words, to partially replace the foregone testing, service virtualization would allow you to create a “veneer” that would, whenever you invoke the dependent software during your integration testing, spit back the response that the dependent software should give. This particular solution provides two ways of creating the necessary “scripts” (my categorization, not Parasoft’s):

1.       Build a model of how the dependent software runs, and invoke the model to generate the responses; and
2.       Take a log of the actions of the dependent software during some recent period, and use that information to drive the responses.

Before I analyze what this does and does not do, let me note that I believe Approach #2 is typically the way to go. The bias of an IT department considering skipping the dependent-software integration testing step is towards assuming that there will be no problems.  The person building the model will therefore often implicitly build it the way the dependent software would work if there really were no problems – and response times are often guesstimates.  The log of actual actions introduces a needed additional note of realism into the testing.

However, the time period being logged almost inevitably does not capture all cases – end-of-year closing, for example.  The person creating the “virtualized service” should have a model in mind that allows him or her to add the necessary cases not covered by the log.

Gains and Limits

The “service virtualization” idea is, I believe, a major advance over the previous choice between a major disruption of online systems and a risk of catastrophic downtime during deployment. If one takes Approach #2 as described above, “service virtualization” will add very little on to testing time and preparation, while in the vast majority of cases it will detect those integration-test problems that represent the final barrier to effective testing before deployment.  In other words, you should be able to decrease the risk of software introduction crashes tenfold or a hundredfold. 

The example I cited above is a case in point.  It would have taken fairly little effort to use a log of customer-facing app interactions in a sandbox integration test with the new back-end systems.  This would also have speeded up the process of incremental testing of the new software once a problem was detected. 

There are limits to the gains from the new testing approach – although, let’s note up front that these do not detract in the slightest from the advantages of “service virtualization”.  First, even if you take Approach #2, you are effectively doing integration testing, not volume/stress testing.  If you think about it, what you are mimicking is the behavior of the “dependent” software before the new systems are introduced.  It is possible, nay, likely, that the new software will add volume/stress to the other software in your data center that it interacts with.  And so, if the added stress does cause problems, you won’t find out about it until you’re operating online and your mission-critical software slows to a crawl or crashes.  Not very likely; but still possible.

Second, it is very possible that there will be a lag time between the time when you capture the behavior of the “dependent” software and the time that you run the tests.  It is fairly simple to ensure that you have the latest and greatest version of “dependent” software with the latest bug fixes, and to keep track of whatever changes happen online during sandbox testing offline. If you are just periodically refreshing the log “snapshot” as in Approach #2, or even operating from a model written a year ago, as happens all too often, then there is a real possibility that you have missed crucial changes to the “dependent” software that will cause your integration testing to succeed and then your deployment to crash. Luckily, ex-post analysis of dependent-software changes makes fixing this problem much easier – but it should be minimized by straightforward monitoring of operational-dependent-software mods during testing.

The Bottom Line for Parasoft and “Service Virtualization” Testing:  Worth Looking At Right Now

The Parasoft solution appears to apply especially to IT shops with significant experience with “skipping” integration testing due to “dependent software”, and with a reasonably sophisticated test harness.  Of course, if you don’t have a reasonably sophisticated test harness that can do integration testing of new software and other operational systems in your environments, perhaps you should consider acquiring one.  I suspect that the case I cited earlier not only failed to sandbox integration testing, but didn’t have the test harness to do so even had they wanted to. 

For those IT shops fitting my criteria, there seems no real reason to wait to kick the tires of, and probably buy, additional “service virtualization” features.  As I said, the downside in terms of added test time and effort in these cases appears minimal, and the gains from the additional software robustness clear, and potentially company-reputation-saving.

I will, however, add one note of caution, not about the solution, but about your strategy in using it.  Practically speaking, “service virtualization” is rarely if ever to be preferred to full integration testing, if you can do it.  It would be a very bad idea to use the new tools to move the boundary between what is fully tested, because you can manage it in a reasonable time, and what is quicker and easier but risks disaster.  Do use “service virtualization” to replace naked “close your eyes and hope” deployment; don’t use “service virtualization” to replace an existing thorough integration test.

Kudos to Parasoft for marketing such a good idea.  Check it out.

Monday, February 18, 2013

The Other Sad Task of Combating Climate Change


Writing this kind of blog post tends to freeze my brain.  I find it astounding sometimes to be trying to present in a logical and calm fashion a description of horrors.  But there it is.

I noticed in perusing comments in various climate-change-related web sites that even among fairly well-informed folks there seems to be a misperception, which runs something like this:  The job of combating climate change is about slowing carbon emissions, preferably as quickly as possible.  As I understand it, that is half right.  There is another distinct task:  leaving at least a significant amount of carbon-emitting “fossil fuels” in the ground – forever, or at least for the next 100-1000 years.  Moreover, that task assumes that we do not discover major new sources of oil, natural gas, and coal.  If we do, then we need to leave the equivalent of a significant amount of present “reserves” plus all reserves discovered in the future in the ground.

One implication of this:  we need to understand that there is a Hard Stop somewhere in the future, a point beyond which we dare not use even one milligram more of fossil fuels.  If we cut carbon emissions drastically in the near future, and keep them cut, that Hard Stop almost certainly will never arrive – instead, we will suffer various degrees of what Joe Romm calls “Hell and High Water”, involving at worst the decimation (not in the Roman sense – in the sense that 9/10ths of humanity will die, mostly of starvation, disease, and poisoned air) of humankind.  If we continue on the present path of fossil-fuel use increases and minor moves towards “sustainability”, the so-called “business as usual”, that Hard Stop may even arrive by the end of this century.  That Hard Stop represents absolutely no further use of fossil fuels because the alternative might be the end of all life on earth, forever.

How can I say this?  How can I not be wildly exaggerating, in Mark Twain’s sense (“the reports of my death are wildly exaggerated”)?

Keystone and Game Over

It may strike people as odd that there is such an environmental furor over one oil pipeline project in the US (the Keystone XL proposal).  Here’s a frequently cited quote (paraphrased) by Dr. James Hansen on the subject:  “If Keystone XL goes forward, then it’s game over for the climate.”  Most people, my sense is, read that as meaning that some form of “Hell and High Water” becomes inevitable.  I believe that instead, he is also referring to a previous quote (in, I believe, his book Storms of My Grandchildren, and also paraphrased):  “If we use all our present reserves of coal and oil, there is a significant chance of a runaway greenhouse effect.  If we also use all our tar sands and oil shale, I view the runaway greenhouse effect as likely.” Before I explain my understanding of this, let’s note that Keystone XL transports oil derived from Canadian tar sands to US ports for export abroad.

What’s a runaway greenhouse effect?  If we look at Venus, we see a planet with extreme heat and with acid rain that dissolves any life forms that might exist in the air, and then evaporates before it reaches the surface.  However, if Venus had no atmosphere, there would be no extreme heat and no acid rain.  Instead the temperature would be a significant distance below the “runaway point” (estimated by Dr. Hansen at somewhere around 62 degrees Fahrenheit, iirc). Carbon or other substances in the atmosphere reflect light-generated heat bounced from the surface back to the surface again, trapping it – and also increasing the acidity of water (again, as I understand it).  If Earth passes that “runaway point”, then we will become like Venus.

Now, Earth without an atmosphere would be far below the temperature of Venus – below freezing, actually.  The atmosphere adds one layer of carbon-based reflection or “trapping” of heat (yes, I realize I’m simplifying drastically).  Life itself – all life, especially vegetable – adds another.  Life is carbon-based, and it creates a carbon cycle that emits carbon to the atmosphere, and then absorbs it in non-organic matter when it returns, via a process called “weathering” that deposits much of the carbon returned into the oceans. In ordinary times, this creates a way of handling perturbations in carbon emissions so that one returns eventually to somewhat of a “steady state”. And that “steady state” is still clearly under the “runaway point.”

Now here is where we get to the importance of leaving some fossil fuels in the ground.  Because we have seen “Hell and High Water” in the past, and life has been decimated but survived.  But what are fossil fuels, really?  Primarily the carbon deposited in the ground by life – especially vegetable life – over the last up to a billion years or so.  Now compare this episode of carbon emissions to all past episodes.  We have seen surges in carbon emissions from the Milankovitch cycle before, and from long-term underwater eruptions that bring new carbon up from the Earth’s core to the air.  We have seen methane spurts due to accompanying thawing of places like the Arctic that may have made the temperature rises and carbon in the atmosphere more extreme.  What we have never seen before is taking all the stored carbon for hundreds of millions of years and injecting it into the atmosphere over what could turn out to be a period of 200 years (and carbon has a half-life of perhaps 100 years in the atmosphere). 

And Hansen’s best estimate is that use of all of that stored carbon over a period even of much longer than 200 years is likely to bring on a runaway greenhouse effect.  This is because the ocean is the primary way of restoring equilibrium to the system, and at some point before we use up all that carbon, if we do it fast enough now, the ocean stops being able to absorb as much carbon (apparently, according to Wikipedia, because of the slowing of the “biological pump”) – and carbon coming down from the atmosphere cycles right back up again.  And so, once that point is reached, carbon doesn’t cycle very much back into the ocean – it goes on accumulating in the atmosphere, for a thousand years or more, until the ocean begins to regain its ability to absorb carbon. Thus, as we get close to the “runaway point”, we can’t just slow carbon emissions down to a point at which as much carbon is returning as is being emitted – the only point at which that is true is near-zero emissions, a Hard Stop.

Now, hopefully, you begin to see why it’s important to leave significant amounts of fossil fuels in the ground for at least 100-1000 years:  it keeps us away from that Hard Stop, and hence that “runaway point.”  It keeps us away from the ultimate horror.

So why is Keystone XL so critical to this?  There is at present no real market for tar sands oil.  There are very high up-front costs, which only the Canadian government has taken on so far – and no one appears likely to, in the immediate future, if the Canadians don’t succeed.  The only realistic way of getting that oil from inland Canada to a decent market, it appears, is to add to existing pipelines and send it to ports in the southern US – all other routes appear to involve too-large costs and times of building new infrastructure – and further carbon emissions from tar sands oil will be minimal. If Keystone XL goes through, it appears likely that a significant portion of the world’s tar sand oil will be emitted over the next 40 years – if not, not.

That’s why Keystone XL matters.  That’s why Hansen has been campaigning for several years to stop most worldwide production of coal, as the least painful way of avoiding the “significant chance” of a runaway greenhouse effect. That’s why people need to think about handling climate change today as not simply a matter of adaptation to “Hell and High Water” or slowing down carbon emissions by a couple of percentage points per year right now.  It has now reached the point where we need to face the idea of effectively never using some of those fossil fuels – not just letting the market assume that using it all is OK.

Action Items and Dyslogy

What our sad other task of facing climate change amounts to, therefore, imho, is not only to stop Keystone XL in its tracks.  It amounts to making sure that Keystone and its ilk never happen.  It also amounts to trying to ensure, with each future use of fossil fuels, that a comparable amount of reserves is made unusable, effectively, forever (or until 1000 years from now, whichever comes first).  And it means keeping an eye on new sources of fossil fuels, to limit their use sharply forever.

And if we fail?  I suppose we can write a eulogy for life on Earth.  Except that writing a eulogy for a species that ended life forever seems a bit off, somehow.  The opposite of Utopia is dystopia; I guess we should write a dyslogy.  Someone recently passed me the end of a Swinburne poem that seems to fit – it even includes the sea rise that’s an initial stage.  I have changed one word.

Here death may deal not again for ever;
       Here change may come not till all change end.
From the graves they have made they shall rise up never,
       Who have left nought living to ravage and rend.
Earth, stones, and thorns of the wild ground growing,
       While the sun and the rain live, these shall be;
Till a last wind's breath upon all these blowing
               Roll the sea.

Till the slow sea rise and the sheer cliff crumble,
       Till terrace and meadow the deep gulfs drink,
Till the strength of the waves of the high tides humble
       The fields that lessen, the rocks that shrink,
Here now in his triumph where all things falter,
       Stretched out on the spoils that his own hand spread,
As a god self-slain on his own strange altar,
               Man [Swinburne – Death] lies dead.