Monday, January 30, 2012

The Other BI: Orange EDA and Statistical Analytics

This blog post highlights a software company and technology that I view as potentially useful to organizations investing in business intelligence (BI) and analytics in the next few years. Note that, in my opinion, this company and solution are not typically “top of the mind” when we talk about BI today.

The Importance of Orange-Type Statistical Analysis to Analytics

BI has taken a major step forward in maturity over the last few years, as statistical packages have become more associated with analytics. Granted, SAS has for years distinguished itself by its statistics-focused BI solution; but when IBM recently acquired SPSS, the grand-daddy of statistical packages, the importance of more rigorous analysis of company and customer data seemed both confirmed and more obvious. Moreover, over the years, data miners have begun to draw on the insights of university researchers about things like “data mining bias” and Bayesian statistics – and the most in-depth, competitive-advantage-determining analyses have benefited as a result. And so, it would seem that we are on a nice technology glide path, as statistics completes the flexibility of analytics by covering one extreme of certainty and analytical complexity, while traditional analytics tools cover the rest of the spectrum up from situations where shallow and imprecise analysis is appropriate, and as statistical techniques filter down by technology evolution to the “unwashed masses” of end users. Or are we?

You see, there is a glaring gap in this picture of increasing knowledge of what’s going on – or at least a gap that should be glaring. This gap might be summed up as Alice in Wonderland’s “verdict first, then the trial”, or business’ “when you have a hammer, everything looks like a nail.” Both the business and the researcher start with their own narrow picture of what the customer or research subject should look like, and the analytics and statistics that start with such hypotheses are designed to narrow in on a solution rather than expand due to unexpected data, and so the business/researcher is very likely to miss key customer insights, psychological and otherwise. Pile on top of this the “not invented here” syndrome characteristic of most enterprises, and the “confirmation bias” that recent research has shown to be prevalent among individuals and organizations, and you have a real analytical problem on your hands.

This is not a purely theoretical problem, if you will excuse the bad joke. In the psychological statistics area, the recent popularity of “qualitative methods” has exposed, to those who are willing to see, the enormous amount of insights that traditional statistics fails to capture about customer psychology, sociology, and behavior. Both approaches, of course, would seem to suffer from the deficit that Richard Feynman pointed out – the lack of control groups that renders any conclusion suspect because a “placebo” or “Hawthorne” effect may be involved – but it should be noted that even when (as seems to be happening) this problem is compensated for, the “verdict first” problem remains, because the world of people is far less easy to pre-define than that of nuclear physics.

In the world of business, as I can personally attest, the same type of problem exists in data-gathering. For more than a decade, I have run TCO studies, particularly on SMB use of databases. I discovered early on that open-ended interviews of relatively few sysadmins were far more effective in capturing the real costs of databases than far wider-spread on-a-scale-from-1-to-5 inflexible surveys of CIOs. Moreover, if I just included the ability of the interviewee to tell a story from his or her point of view, the respondent would consistently come up with an insight of extraordinary value, such as the idea that SMBs didn’t care so much about technology that saved operational costs as much as technology that saved a local-office head time by requiring him or her to just press a button as he or she shut off the lights on Saturday night. The key to success for my “surveys” was that they were designed to be open-ended (able to go in a new direction during the interview, and leaving space for whatever the interviewer might have left out), interviewee-driven (they started by letting the interviewee tell a story as he or she saw it), and flexible in the kind of data collected (typically, an IT organization did not know the overall costs of database administration for their organization [and in a survey, they would have guessed – badly], but they almost invariably knew how many database instances per administrator).

As it turns out, there is a comparable statistical approach for the data-analysis side of things. It’s called Exploratory Data Analysis, or EDA.

As it has evolved in the decades since John Tukey first popularized it, EDA is about analyzing smaller amounts of data to generate as many plausible hypotheses (or “patterns in the data”) as possible, before winnowing them down with further data. To further clear the statistical researcher’s mind of bias, the technique creates abstract unlabeled visualizations (“data visualization”) of the patterns, such as the strangely-named box-and-whisker plot. The analysis is not deep – but it identifies far more hypotheses, and therefore quite a few more areas where in-depth analysis may reveal key insights. The automation of these techniques has made the application of EDA a minor blip in the average analyst’s process, and so effective use of EDA should yield a major improvement in analytics effectiveness “at the margin” (in the resulting in-depth analyses) for a very small time “overhead cost.” In fact, EDA has reached the point, as in the Orange open-source solution, where it is merged with a full-fledged data-mining tool.

And yet, I find that most in university research and in industry are barely aware that EDA exists, much less that it might have some significant use. For a while, SAS’ JMP product stood bravely and alone as a tool that could at least potentially be used by businesses – but I note that according to Wikipedia they have recently discontinued support for its use on Linux.

So let’s summarize: EDA is out there. It’s easy to use. Now that statistical analysis in general is creeping into greater use in analytics, users are ready for it. I fully anticipate that it would have major positive effects on in-depth analytics for enterprises from the very largest down at least to the larger medium-sized ones. IT shops will have to do some customization and integration themselves, because most if not all vendors have not yet fully integrated it as part of the analytics process in their BI suites; but with open-source and other “standard” EDA tools, that’s not inordinately difficult. The only thing lacking is for somebody, anybody, to wake up and pay attention.

The Relevance of Orange EDA to Statistical-Analysis-Type BI

Orange’s relevance may already be apparent from the above, but I’ll say it again anyway. Orange’s EDA solution includes integration with enterprise-type data-mining analytics, and supports a wide range of data visualization techniques, making it a leadership supplier in “fit to your enterprise’s analytics.” Orange is open source, which means it’s as cheap as you can get for quick-and-dirty, and also means it’s not going to go away. Most importantly, Orange lays down a solid, relatively standardized foundation that should be easy to incorporate or upgrade from, when someday the major vendors finally move into the area and provide fancier techniques and better integration with a full-fledged BI suite. That’s all; and that’s plenty.

Potential Uses of Orange-Type EDA in Analytics for IT

Since IT will need to do some of the initial legwork here, without the usual help from one’s BI supplier, the most effective initial use of Orange-type EDA is in support of the longer-term efforts of today’s business analysts, and not in IT-driven agile BI. However, IT should find these business analysts to be surprisingly receptive – or, at the least, as recent surveys suggest, amazed that IT isn’t being a “boat anchor” yet again. You see, EDA has a sheen of “innovation” about it, and so folks who are in some way associated with the business’ “innovation” efforts should like it a lot. The rest is simply a matter of its becoming part of these business analysts' steadily accumulating toolkit of rapid-query-generation and statistical-in-depth-insight-at-the-margin tools. EDA may not in the normal course of usage get the glory of notice as the source of a new competition-killer; but with a little assiduous use-case monitoring by IT, the business case can be made.

It is equally important for IT to note that EDA is twice as effective if it is joined at the front end by a data-gathering process that is to a much greater extent (to recap) open-ended, customer-driven, and flexible (in fact, agile) in the type of data gathered. Remember, there are ways of doing this – such as parallel in-depth customer interviews or Internet surveys that don’t just parrot SurveyMonkey – that add very little “overhead” to data-gathering. IT should seriously consider doing this as well, and preferably design the data-gathering process so as to feed the gathered data to Orange-type EDA tools where in-depth statistical analysis of that data will probably be appropriate as the next step. The overall effect will be like replacing a steadily narrowing view of the data with one that expands the potential analyses until the right balance between “data blindness” and “paralysis by analysis” risks is reached.

The Bottom Line for IT Buyers

To view Orange-type EDA as comparable to the other BI technologies/solutions I have discussed so far is to miss the point. EDA is much more like agile development – its main value lies in changing our analytics methodology, not in improving analytics itself. It helps the organization itself to think not “outside the box”, but “outside the organization” – to be able to combine the viewpoint of the vendor with the viewpoint and reality of the customer, rather than trying to force customer interactions into corporate fantasies of the way customers should think and act for maximum vendor profit. We have all seen the major public-relations disaster of Bank of America charges for debit cards – one that, if we were honest, we would admit most other enterprises find it all too easy to stumble into. If EDA (or, better still, EDA plus open-ended, customer-driven, flexible data-gathering) prevents only one such misstep, it will have paid for itself ten times over, no matter what the numbers say. In a nutshell: EDA seems like it’s about competitive advantage; that’s true as far as it goes, but EDA is actually much more about business risk.

The Orange value proposition for such uses of EDA has been noted twice already; no need to repeat it a third time. For IT buyers, it simply means that any time you decide to do EDA, Orange is there as part of a rather short short list. So that leaves the IT buyer’s final question: what’s the hurry?

And, of course, since EDA is about competitive advantage (sarcasm), there is no hurry. Unless you consider the possibility that each non-EDA enterprise is a bit like a drunk staggering along a sidewalk who has just knocked over the fence bordering an abyss, and who if he then happens to stagger over the edge is busy blaming the owner of the fence (the CEO?) all the way to the bottom. That abyss is the risk of offending the customer. That inebriation is business as usual. EDA helps you sober up, fast.

I can’t say that you have to implement EDA now or you’ll fall. But do you really want to risk doing nothing?

1 comment:

Robert said...

This is really a great blog .It always share informative and useful article really like its fabulous article. its a great learning keep sharing such kind of nice info. I will keep visiting.