Thoughts From a Retired Software IT Analyst

Monday, January 30, 2012

The Other BI: Orange EDA and Statistical Analytics

This blog post highlights a software company and technology that I view as potentially useful to organizations investing in business intelligence (BI) and analytics in the next few years. Note that, in my opinion, this company and solution are not typically “top of the mind” when we talk about BI today.

The Importance of Orange-Type Statistical Analysis to Analytics

BI has taken a major step forward in maturity over the last few years, as statistical packages have become more associated with analytics. Granted, SAS has for years distinguished itself by its statistics-focused BI solution; but when IBM recently acquired SPSS, the grand-daddy of statistical packages, the importance of more rigorous analysis of company and customer data seemed both confirmed and more obvious. Moreover, over the years, data miners have begun to draw on the insights of university researchers about things like “data mining bias” and Bayesian statistics – and the most in-depth, competitive-advantage-determining analyses have benefited as a result. And so, it would seem that we are on a nice technology glide path, as statistics completes the flexibility of analytics by covering one extreme of certainty and analytical complexity, while traditional analytics tools cover the rest of the spectrum up from situations where shallow and imprecise analysis is appropriate, and as statistical techniques filter down by technology evolution to the “unwashed masses” of end users. Or are we?

You see, there is a glaring gap in this picture of increasing knowledge of what’s going on – or at least a gap that should be glaring. This gap might be summed up as Alice in Wonderland’s “verdict first, then the trial”, or business’ “when you have a hammer, everything looks like a nail.” Both the business and the researcher start with their own narrow picture of what the customer or research subject should look like, and the analytics and statistics that start with such hypotheses are designed to narrow in on a solution rather than expand due to unexpected data, and so the business/researcher is very likely to miss key customer insights, psychological and otherwise. Pile on top of this the “not invented here” syndrome characteristic of most enterprises, and the “confirmation bias” that recent research has shown to be prevalent among individuals and organizations, and you have a real analytical problem on your hands.

This is not a purely theoretical problem, if you will excuse the bad joke. In the psychological statistics area, the recent popularity of “qualitative methods” has exposed, to those who are willing to see, the enormous amount of insights that traditional statistics fails to capture about customer psychology, sociology, and behavior. Both approaches, of course, would seem to suffer from the deficit that Richard Feynman pointed out – the lack of control groups that renders any conclusion suspect because a “placebo” or “Hawthorne” effect may be involved – but it should be noted that even when (as seems to be happening) this problem is compensated for, the “verdict first” problem remains, because the world of people is far less easy to pre-define than that of nuclear physics.

In the world of business, as I can personally attest, the same type of problem exists in data-gathering. For more than a decade, I have run TCO studies, particularly on SMB use of databases. I discovered early on that open-ended interviews of relatively few sysadmins were far more effective in capturing the real costs of databases than far wider-spread on-a-scale-from-1-to-5 inflexible surveys of CIOs. Moreover, if I just included the ability of the interviewee to tell a story from his or her point of view, the respondent would consistently come up with an insight of extraordinary value, such as the idea that SMBs didn’t care so much about technology that saved operational costs as much as technology that saved a local-office head time by requiring him or her to just press a button as he or she shut off the lights on Saturday night. The key to success for my “surveys” was that they were designed to be open-ended (able to go in a new direction during the interview, and leaving space for whatever the interviewer might have left out), interviewee-driven (they started by letting the interviewee tell a story as he or she saw it), and flexible in the kind of data collected (typically, an IT organization did not know the overall costs of database administration for their organization [and in a survey, they would have guessed – badly], but they almost invariably knew how many database instances per administrator).

As it turns out, there is a comparable statistical approach for the data-analysis side of things. It’s called Exploratory Data Analysis, or EDA.

As it has evolved in the decades since John Tukey first popularized it, EDA is about analyzing smaller amounts of data to generate as many plausible hypotheses (or “patterns in the data”) as possible, before winnowing them down with further data. To further clear the statistical researcher’s mind of bias, the technique creates abstract unlabeled visualizations (“data visualization”) of the patterns, such as the strangely-named box-and-whisker plot. The analysis is not deep – but it identifies far more hypotheses, and therefore quite a few more areas where in-depth analysis may reveal key insights. The automation of these techniques has made the application of EDA a minor blip in the average analyst’s process, and so effective use of EDA should yield a major improvement in analytics effectiveness “at the margin” (in the resulting in-depth analyses) for a very small time “overhead cost.” In fact, EDA has reached the point, as in the Orange open-source solution, where it is merged with a full-fledged data-mining tool.

And yet, I find that most in university research and in industry are barely aware that EDA exists, much less that it might have some significant use. For a while, SAS’ JMP product stood bravely and alone as a tool that could at least potentially be used by businesses – but I note that according to Wikipedia they have recently discontinued support for its use on Linux.

So let’s summarize: EDA is out there. It’s easy to use. Now that statistical analysis in general is creeping into greater use in analytics, users are ready for it. I fully anticipate that it would have major positive effects on in-depth analytics for enterprises from the very largest down at least to the larger medium-sized ones. IT shops will have to do some customization and integration themselves, because most if not all vendors have not yet fully integrated it as part of the analytics process in their BI suites; but with open-source and other “standard” EDA tools, that’s not inordinately difficult. The only thing lacking is for somebody, anybody, to wake up and pay attention.

The Relevance of Orange EDA to Statistical-Analysis-Type BI

Orange’s relevance may already be apparent from the above, but I’ll say it again anyway. Orange’s EDA solution includes integration with enterprise-type data-mining analytics, and supports a wide range of data visualization techniques, making it a leadership supplier in “fit to your enterprise’s analytics.” Orange is open source, which means it’s as cheap as you can get for quick-and-dirty, and also means it’s not going to go away. Most importantly, Orange lays down a solid, relatively standardized foundation that should be easy to incorporate or upgrade from, when someday the major vendors finally move into the area and provide fancier techniques and better integration with a full-fledged BI suite. That’s all; and that’s plenty.

Potential Uses of Orange-Type EDA in Analytics for IT

Since IT will need to do some of the initial legwork here, without the usual help from one’s BI supplier, the most effective initial use of Orange-type EDA is in support of the longer-term efforts of today’s business analysts, and not in IT-driven agile BI. However, IT should find these business analysts to be surprisingly receptive – or, at the least, as recent surveys suggest, amazed that IT isn’t being a “boat anchor” yet again. You see, EDA has a sheen of “innovation” about it, and so folks who are in some way associated with the business’ “innovation” efforts should like it a lot. The rest is simply a matter of its becoming part of these business analysts' steadily accumulating toolkit of rapid-query-generation and statistical-in-depth-insight-at-the-margin tools. EDA may not in the normal course of usage get the glory of notice as the source of a new competition-killer; but with a little assiduous use-case monitoring by IT, the business case can be made.

It is equally important for IT to note that EDA is twice as effective if it is joined at the front end by a data-gathering process that is to a much greater extent (to recap) open-ended, customer-driven, and flexible (in fact, agile) in the type of data gathered. Remember, there are ways of doing this – such as parallel in-depth customer interviews or Internet surveys that don’t just parrot SurveyMonkey – that add very little “overhead” to data-gathering. IT should seriously consider doing this as well, and preferably design the data-gathering process so as to feed the gathered data to Orange-type EDA tools where in-depth statistical analysis of that data will probably be appropriate as the next step. The overall effect will be like replacing a steadily narrowing view of the data with one that expands the potential analyses until the right balance between “data blindness” and “paralysis by analysis” risks is reached.

The Bottom Line for IT Buyers

To view Orange-type EDA as comparable to the other BI technologies/solutions I have discussed so far is to miss the point. EDA is much more like agile development – its main value lies in changing our analytics methodology, not in improving analytics itself. It helps the organization itself to think not “outside the box”, but “outside the organization” – to be able to combine the viewpoint of the vendor with the viewpoint and reality of the customer, rather than trying to force customer interactions into corporate fantasies of the way customers should think and act for maximum vendor profit. We have all seen the major public-relations disaster of Bank of America charges for debit cards – one that, if we were honest, we would admit most other enterprises find it all too easy to stumble into. If EDA (or, better still, EDA plus open-ended, customer-driven, flexible data-gathering) prevents only one such misstep, it will have paid for itself ten times over, no matter what the numbers say. In a nutshell: EDA seems like it’s about competitive advantage; that’s true as far as it goes, but EDA is actually much more about business risk.

The Orange value proposition for such uses of EDA has been noted twice already; no need to repeat it a third time. For IT buyers, it simply means that any time you decide to do EDA, Orange is there as part of a rather short short list. So that leaves the IT buyer’s final question: what’s the hurry?

And, of course, since EDA is about competitive advantage (sarcasm), there is no hurry. Unless you consider the possibility that each non-EDA enterprise is a bit like a drunk staggering along a sidewalk who has just knocked over the fence bordering an abyss, and who if he then happens to stagger over the edge is busy blaming the owner of the fence (the CEO?) all the way to the bottom. That abyss is the risk of offending the customer. That inebriation is business as usual. EDA helps you sober up, fast.

I can’t say that you have to implement EDA now or you’ll fall. But do you really want to risk doing nothing?

Thursday, January 26, 2012

The Other Agile Development: Thoughtworks and Continuous Delivery

This blog post highlights a software company and technology that I view as potentially useful to organizations investing in agile development, agile new product development, and business agility over the next few years. Note that, in my opinion, this company and solution are not typically “top of the mind” when we talk about agile development today.

The Importance of Continuous Delivery to Agile Development

One of the most enjoyable parts of writing about “the other” in general and “the other agile development” in particular is that it allows me to revisit and go more in-depth on cool new technologies. And this is a cool new technology if ever there was one.

Continuous Delivery, as Thoughtworks presents it, aims to develop, upgrade, and evolve software by constant, incremental bug fixes, changes, and addition of features (note: CD is not to be confused with Continuous Integration, which I hope to cover in a future blog post). The example cited is that of Flickr, the photo sharing site, which is using Continuous Delivery to change its production web site at the rate of ten or more changes per day. Continuous Delivery achieves this rate not only by overlapping development of these changes, but also by modularizing them in small chunks that still “add value” to the end user, as well as by shortening the process from idea to deployment to less than a day in many cases.

Continuous Delivery, therefore, is a logical end point of the whole idea of agile development – and, indeed, agile development processes are the way that Thoughtworks and Flickr choose to achieve this end point. Close, constant interaction with customers/end users is in there; so is the idea of changing directions rapidly, either within each feature’s development process or by a follow-on short process that modifies the original. Operations and development, as well as testing and development, are far more intertwined. The shortness of the process allows such efficiencies as “trunk-based development”, in which the process specifically forbids multi-person parallel development “branches” and thus avoids their inevitable communication and collaboration time, which in a short process turns out to be greater than the time saved by parallelization.

Now, let’s take a really broad view of Continuous Delivery. Unfortunately, blog posts are not great at handling graphs, so I’d like you to visualize in your head a graph with two axes, Features and Time. Over time, in each graph, the user’s need for features in a product and solution tends to go up at a fairly steady rate over time, as do consumer needs in general. What varies is how well the vendor(s) supply those needs.

The old model – as old as markets and technology – of what happened was this: somewhere between each version and the next, the disconnect between what the consumer wants and what the product delivers becomes too great, and at that point the vendor starts developing a new version, based on where the consumer is right now (plus a very minor projection into the future which we’ll ignore). For the most part, during this 6 month-2 year development process, the original spec does not change; so for 6 months to 2 years before another version comes out, no or few new features are added – but meanwhile, consumers start looking for new features on top of what they already wanted. The result is a stair-step progression, in which each new version takes the product only partway to meeting the consumer’s needs at that time, and the space between the user line and the vendor line represents lost sales and consumer frustration. However, since every other vendor is doing the same thing, no harm, no foul.

Now consider agile development. Agile development, remember, is about rapid delivery of incremental time-to-value, plus frequent changes to the spec based on end-user input. What that looks like in our graph, more or less, is a stairstep in which each step takes a shorter amount of time, and each “rise” is much closer to the level of user need at that time – but we’re still a little bit behind.

Conceptually, Continuous Delivery takes that idea almost as far as it can be taken. Now, our graph looks more like a squiggly product line overlapping the user need line. And here’s the key point: it actually goes above the user need line, just by a little, frequently.

How can that be, you say? Well, despite the way we disparage technology-driven products compared to need-driven ones, the fact remains that sometimes techies anticipate consumer needs. The typical way is that implicit in the design of the product are future features that the end user, with his or her tunnel vision on immediate frustrations, will never think of. It is the developer who suggests these to the user, not the other way around, or the developer who puts these in the product “for free”, understanding that since they are a logical technical evolution of the design, the user will see them as less strange and simpler to use. This may sound risky to the development manager, but in point of fact this is a minimal-risk kind of customer anticipation, with minimal impact on customer frustration even if it doesn’t pan out, and maximal impact on the consumer’s image of “the brand that anticipates my needs”.

One more variant on the graph: suppose we are talking about New Product Development (NPD) in general. Well, one thing about agile software development is that software is becoming an increasing part of competitive advantage in “hardware” and “services” across most industries. In other words, the development of a new “hardware” or “services” product now typically includes a fair amount of software, whose development is in-house, outsourced, or assigned to packaged-software vendors. In each of these cases, application of agile development processes produces a “mix” between the traditional-graph vendor line and the agile-development one. Visually, the “steps” between “rises” are no longer flat, but broken into little “mini-rises” and “ministeps” that take you a little closer to the user needs line. Continuous Delivery on software that is half of NPD effectively eliminates about half of the lost sales and customer frustration from the traditional approach.

Do you remember that I said: “CD takes the idea almost as far as it can be taken?” Well, the one thing agile development via CD doesn’t handle is a big jump in consumer needs because the consumer wants one of the features that is being developed over in the next county – “disruptive” technology. For instance, Apple really put a hurt on other user-interface vendors when it tweaked touch-screen technology for the iPhone. However, the software in other cell-phone and computer products at least allowed a more rapid partial response to the threat. So CD isn’t a cure-all – it just comes amazingly close to it. How to “go the final mile” is a discussion for a future blog post.

Let’s summarize: CD is an incredibly cool and incredibly useful technology, both to the vendor and to the consumer, because it results in a major increase in sales for the vendor and a major increase in satisfaction for the consumer. Moreover, because it’s also cheaper than traditional software development, both vendors and consumers see their costs decrease as the vendor’s use of CD in NPD rises (and for the typical business, that’s in addition to the cost savings and competitive advantage from more rapid development of better business-process software). Finally, because their needs are both satisfied and anticipated, customers become far more loyal, reducing business risk drastically.

Really minor nit: I note that CD is often applied strictly to the delivery stage of development. To my mind, extension to the entire process is appropriate, because delivery is the only major development stage (if we exclude operations after delivery) where the agile development methodology today is often applied minimally. In other words, “delivery” can mean one stage or the whole process, and in the real world if you make that one stage agile you are usually making the whole process agile – so it’s a good idea to emphasize the point of the whole exercise by equating continuous completion and continuous delivery.

The Relevance of Thoughtworks to Continuous Delivery

Thoughtworks is one of those smaller consultancies that took the Agile Manifesto’s ideas and ran with them while large development-tool vendors focused on other, ultimately far less effective, “best practices.” I have had my differences with Thoughtworks in the past, but always from a position of respect for an organization that clearly has agile development in its DNA to a greater extent than most. In the case of CD, a quick scan of the first page of a Google search on Continuous Delivery reveals no one else visibly applying it to the extent that Thoughtworks claims to be doing.

Does that mean that Thoughtworks is a stable vendor for the long term? One of the fascinating things about the agile-development market is that the question matters to a much lesser extent than in all previous markets. Look, in the early years some tried SCRUM and some tried extreme programming and some tried half-way solutions like slightly modified spiral programming, but it didn’t matter in the long run: the Wipros of the world have still done almost universally better than the startups focusing on Java or even folks like Cambridge Technology Partners. And that’s because agile development firms are, well, agile. It doesn’t just rub off on developers; some of it rubs off on managers, and strategists, and even, Ghu help us, on CEOs. They focus on changes; so, on average, they evolve more effectively. That’s as true of Thoughtworks as anyone else.

What isn’t true of most other agile development vendors, right now, is that Thoughtworks appears to have a significant edge in experience in the CD extension of agile development. That matters because, as I’ve said, something like Thoughtworks-type CD is the logical endpoint of agile development. So, if you want to get to maximal agile-development benefits sooner rather than later, it certainly seems as if Thoughtworks should be on your short list.

One point here requires elaboration: there is sometimes a misconception that outside providers are selling you agile-development services. That’s at least partially wrong. They are – or should be – fundamentally selling you agile-development training. They will make their money from being ahead of you in experience, and constantly selling you the improvements in agile development that their experience teaches them. Think of them as more like business-strategy consultants, always looking ahead to the new strategic idea and delivering that to you. Believe me, that’s just as valuable as running your data center – and is often more valuable than that.

Thus, Thoughtworks’ advantage in experience is not to be sneezed at. How well it will hold up over time, we will see. However, considering that the bulk of software development, even if it has adopted a modified version of SCRUM en masse, is still typically not very successful in embedding frequent user feedback into the typical project, I would say that Thoughtworks’ edge should last for at least a couple of years – an eon in the timeframe of the agile business.

Potential Uses of Thoughtworks-Type Agile Development for IT

An IT organization must crawl before it can walk; but it should also learn about walking before it tries to do so. That means that if IT is still at the early stages of adopting agile development, it should still apply CD to a “skunkworks” project, and if it doesn’t have such a project, it should create one.

Otherwise, this is not a “targets of opportunity” situation, but rather a “learn and merge” one. IT should bring CD on board as each project is ready for it, no faster, no slower. In my opinion, “ready” means that a project’s development process has (a) adequate process management tools specifically tuned to support agile development, thus allowing it to scale, (b) an adequate “store” of reusable infrastructure software to build on, so that moving to the next incremental feature is not too great a leap, and (c) an attitude from everyone involved that the first thing you do when you get something new is that you make it agile. That’s all. Print and ship.

Well, but today’s CD isn’t adapted to the peculiar needs of my development. Excuse me? You did say your process was agile, didn’t you? Believe me, CD is not only flexible but agile, and if you don’t know the difference, then you need far more help in achieving agility than you realize. What you’re really saying is that you don’t have an agile development process at all, because otherwise it would be straightforward to adapt your methodology to move steadily towards CD – using a vendor just speeds up the process change.

The Bottom Line for IT Buyers

I am always wary of being too enthusiastic about new technologies, because I remember a story in Gore Vidal’s Lincoln. A rich man has a tendency to exaggerate, and hires someone to nudge him at table when he does so. “How was your trip to Egypt?” “Amazing! Why, they have things called pyramids made of pure gold!” Nudge. “And they go a mile high!” Hard stamp on his foot, just as someone asks, “How wide?” He answers, in agony, “About a foot.” I worry that while any given technology’s value may seem a mile high, its real-world application will be about a foot wide.

That said, I really do see Continuous Delivery as a cool new technology that will have an impact, eventually, that will be a mile high and globe-wide. Here’s what I said in a previous post:

[[The real-world success of Continuous Delivery, I assert, signals a Third Age, in which software development is not only fast in aggregate, but also fast in unitary terms – so fast as to make the process of upgrade of a unitary application by feature additions and changes seem “continuous”. Because of the Second Age, software is now pervasive in products and services. Add the new capabilities, and all software-infused products/services -- all products/services – start changing constantly, to the point where we start viewing continuous product change as natural. Products and services that are fundamentally dynamic, not successions of static versions, are a fundamental, massive change to the global economy.

But it goes even further. These Continuous-Delivery product changes also more closely track changes in end user needs. They also increase the chances of success of introductions of the “new, new thing” in technology that are vital to a thriving, growing global economy, because those introductions are based on an understanding of end user needs at this precise moment in time, not two years ago. According to my definition of agility – rapid, effective reactive and proactive changes – they make products and services truly agile. The new world of Continuous Delivery is not just an almost completely dynamic world. It is an almost Agile World. The only un-agile parts are the rest of the company processes besides software development that continue, behind the scenes of rapidly changing products, to patch up fundamentally un-agile approaches in the same old ways.]]

But you don’t need to know about that. What you need to know is that for every IT organization that appreciates agile development, kicking the tires or adopting CD is a good idea, right now. I don’t even have to say it’s necessary, because that’s not the way an agile organization operates.

As for Thoughtworks, here’s what I think IT’s attitude should be. I have often trashed Sun for an ad that said “We built the Internet. Let us build your Internet.” I knew, based on personal experience, that this claim was, to say the least, exaggerated. Well, if Thoughtworks came to your door with a pitch that said “We built Continuous Delivery. Let us build your Continuous Delivery,” I would not only not trash them, I would encourage you to believe them, and consider doing as they request. The pyramid is that high. The materials with which it is built are that valuable. And my foot is still intact.

Wednesday, January 25, 2012

The Other BI: Oracle TimesTen and In-Memory-Database Streaming BI

This blog post highlights a software company and technology that I view as potentially useful to organizations investing in business intelligence (BI) and analytics in the next few years. Note that, in my opinion, this company and solution are not typically “top of the mind” when we talk about BI today.

The Importance of TimesTen-Type In-Memory Database Technology to BI

All right, now I’m really stretching the definition of “other”. Let’s face it, Oracle is “top of the mind” when we talk about BI, and they recently announced a TimesTen appliance, so TimesTen is not an invisible product, either. And finally, the hoopla about SAP HANA means that in-memory database technology itself is probably presently pretty close to the center of IT’s radar screen.

So why do I think Oracle’s TimesTen is in some sense not “top of the mind”? Answer: because there are potential applications of in-memory databases in BI for which the technology itself, much less any vendor’s in-memory database solution, is not a visible presence. In particular, I am talking about in-memory streaming databases.

To understand the relevance of in-memory databases to complex event processing and BI, let’s review the present use cases of in-memory databases. Originally, in-memory technology was just the thing for analyzing medium-scale amounts of financial-market information in real time, information such as constantly changing stock prices. Lately, in-memory databases have added two more BI duties: (a) serving as a “cache” database for enterprise databases, to speed up massive BI where smaller chunks of data could be localized, and (b) serving as a really-high-performance platform for mission-critical small-to-medium-scale BI applications that require less scaling year-to-year, such as some SMB reporting. These new tasks have arrived because rapid growth in main-memory storage has inevitably allowed in-memory databases to tackle a greater share of existing IT data-processing needs. To put it another way, when you have an application that is always going to require 100 GB of storage, sooner or later it makes sense to use an in-memory database and drop the old disk-based one, because in-memory database performance will typically be up to 10-100 times faster.

Now let’s consider event-processing or “streaming” databases. Their main constraint today in many cases is how much historical context they can access in real-time in order to deepen their analysis of incoming data before they have to make a routing or alerting decision. If that data can be accessed in main memory instead of disk, effectively up to 10-100 times the amount of “context” information can be brought to bear in the analysis in the same amount of time.

In other words, for streaming BI, IT potentially has two choices – a traditional event-processing database that is often entirely separate from a back-end disk-based database, or (2) a traditional main-memory database already pre-optimized for in-depth main-memory analysis and usually pre-integrated with a disk-based database (as TimesTen is with Oracle Database) as a “cache database” in cases where disk must be accessed. How to choose between the two? Well, if you don’t need much historical context for analysis, the event-processing database probably has the edge – but if you’re looking to upgrade your streaming BI, that’s not likely to be the case. In other cases, such as those where the processing is “routing-light” and “analysis-heavy”, an in-memory database not yet optimized for routing but far more optimized for in-depth analytics performance would seem to make more sense.

Thus, one way of looking at the use case of in-memory database event processing is to distinguish between in-enterprise and extra-enterprise data streams (more or less). Big Data is an example of an extra-enterprise stream, and can involve a fire hose of “sensor-driven Web” (GPS) and social media data that needs routing and alerting as much as it needs analytics. Business-critical-application-destined and embedded-analytics data streams are an example of in-enterprise data, even if admixed with a little extra-enterprise data; they require heavier-duty cross-analysis of smaller data streams. For these, the in-memory database’s deeper analysis before a split-second decision is made is probably worth its weight in gold, as it is in the traditional financial in-memory-database use case.

Won’t having two databases carrying out the general task of handling streaming data complicate the enterprise architecture? Not really. Past experience shows us that using multiple databases for finer-grained performance optimization actually decreases administrative costs, since the second database, at least, is typically much more “near-lights-out,” while switching between databases doesn’t affect users at all, because a database is infrastructure software that presents the same standard SQL-derivative interfaces no matter what the variant. And, of course, the boundary between event-processing database use cases and in-memory ones is flexible, allowing new ways of evolving performance optimization as user needs change.

The Relevance of Oracle TimesTen to Streaming BI

In many ways, TimesTen is the granddaddy of in-memory databases, a solution that I have been following for fifteen years. It therefore has leadership status in in-memory database use-case experience, and especially in the financial-industry stock-market-data applications that resemble my streaming-BI use case as described above. What Oracle has added since the acquisition is database-cache implementation and experience, especially integrated with Oracle Database. At the same time, TimesTen remains separable at need from other Oracle database products, as in the new TimesTen Appliance.

These characteristics make TimesTen a prime contender for the potential in-memory streaming BI market. Where SAP HANA is a work in progress, and approaches like Volt are perhaps less well integrated with enterprise databases, TimeTen and IBM’s solidDB stand out as combining both in-memory original design and database-cache experience – and of these two, TimesTen has the longer in-memory-database pedigree.

It may seem odd of me to say nice things about Oracle TimesTen, after recent events have raised questions in my mind about Oracle BI pricing, long-term hardware growth path, and possible over-reliance on appliances. However, inherently an in-memory database is much less expensive than an enterprise database. Thus, users appear to have full flexibility to use TimesTen separately from other Oracle solutions, free from worries about possible long-term effects of vendor lock-in.

Potential Uses of TimesTen-Type In-Memory Streaming BI for IT

As noted above, the obvious IT use cases for TimesTen-type streaming BI lie in driving deeper analysis in in-enterprise streaming applications. In particular, in the embedded-analytics area, in-memory performance speedups can allow consideration of a wider array of systems-management data in fine-tuning downtime-threat and performance-slowdown detection. In the real-time analytics area, an in-memory database might be of particular use in avoiding “over-steering”, as when predictable variations in inventories cause overstocking because of lack of historical context. In the Big Data area, an in-memory database might apply where the data has been pre-winnowed to certain customers, and a deeper analysis of those customers fine-tunes an ad campaign. For example, within a half-hour of the end of the game, Dick’s Sporting Goods had sent me an offer of a Patriots’ AFC Championship T-shirt, complete with visualization of the actual T-shirt – a reasonably well-targeted email. That’s something that’s far easier to do with an in-memory database.

IT should also consider the likely evolution of both event-processing and in-memory databases over the next few years, as their capabilities will likely become more similar. Here, the point is that event-processing databases often started out not with data-management tools, but with file-management ones – making them significantly less optimized “from the ground up” for analysis of data in main memory. Still, event-processing databases such as Progress Apama may retain their event-handling, routing, and alerting advantages, and thus the situation in which in-memory is better for in-enterprise and event-processing is better for extra-enterprise is likely to continue. In the meanwhile, increasing use of in-memory databases for the older use cases cited above means that in-memory streaming-BI databases offer an excellent way of gaining experience in their use, before they become ubiquitous. That, in turn, means that narrow initial “targets of opportunity” in one of the situations cited in the previous paragraph are a good idea, whatever the scope of one’s overall in-memory database commitment right now.

The Bottom Line for IT Buyers

In some ways, this is the least urgent and most speculative of the “other BI” solutions I have discussed so far. We are, after all, discussing additional performance and deeper analytics in a particular subset of IT’s needs, and in an area where the technology of in-memory databases and their event-processor alternatives is moving ahead rapidly. In a sense, this is really an opportunity for those IT shops that specialize in applying a little extra effort and “designing smarter” across multiple new technologies to provide a nice ongoing competitive advantage. For the rest, if the shoe can easily be made to fit, why not wear it?

My suggestion for most IT buyers, therefore, is therefore to have a “back-pocket” in-memory-database-for-streaming-BI short list that can be whipped out at the appropriate time. Imho, Oracle TimesTen right now should be on that list.

I hate to close without noting the overall long-term BI potential of in-memory databases. The future of in-memory databases is not, in my firm opinion, to supersede the IBM DB2s, Oracle Databases, and Microsoft SQL Servers of the world, at any time in the next four years. The hardware technologies to enable such a thing are not yet clear, much less competitive. Rather, the value of in-memory databases is to allow us to optimize our querying for both main-memory and disk storage – which are two very different things, and which will both apply appropriately to many key customer needs over the next few years. Overall, the effect will be another major ongoing jump in data-processing performance. As we enter this new database-technology era, those who initially kick the tires in a wider variety of BI projects will find themselves with a significant “experience” advantage over the rest, especially because the key to outstanding success will be determining the appropriate boundary between disk-based and in-memory database usage. Don’t force in-memory streaming BI into the organization. Do keep checking to see if it will fit your immediate needs. Sooner or later, it probably will.

Friday, January 20, 2012

The Other Agile Development: AccuRev SCM and Code Refactoring

Thursday, January 19, 2012

Methane Update: Less Worried, Still Very Worried

I just saw an interview by Skeptical Science with Ms. Sharapova, a Russian scientist, on the Russian findings that sparked my recent methane worries. Her responses clarified the answers to the main questions I had about Arctic sea methane clathrates. Her key information, imho, was the following:

1. The Russians recently discovered that methane clathrates could form not just in the 200-1000 m depth range, but also in the 20-200 m range.
2. They also found that clathrates there did not just melt from the top down; they also melted in pockets below the surface melt.
3. The 2011 survey, for the first time, looked at the 20-200 m Siberian coastal shelf, rather than the 200-2000 m deeper waters.

Let’s look at the implications for my methane analysis. In the first place, this explains why methane was able to bubble to the surface, instead of popping or being eaten by methane-munching bacteria: it was too close to the surface, and especially if, as appears to have been the case, it was being released in larger chunks/bubbles.

In the second place, this appears to indicate that the ramp-up in methane emissions at any particular point is less than I feared. There are several possible reasons to anticipate that at greater depths, methane release from the sediment would ramp up more slowly than a 100-fold increase in one year. Likewise, there are several possible reasons to anticipate that initial methane releases from the shallow continental shelves would be greater than that from deeper areas, if there were methane clathrates there in the first place.

However, in the third place, this newly discovered source of methane clathrates appears to be a much bigger source of emissions, both in terms of melting more rapidly and of having more methane stored to begin with. Because sea shelves slope more rapidly the deeper they get (to a point beyond 1000 m), the sea-surface area of 20-200 m deep shelves is comparable to the sea-surface area of 200-1000 m ones.

Under the surface, the methane clathrates can be stored much deeper before earth heating and pressure melt them. Take these two things together, and the amount of methane in Arctic clathrates may be 2-4 times the amount previously estimated. Meanwhile, this 20-200 m range lies almost entirely in the “shallow ocean” range where warming currents from the south plus warming of newly exposed surface waters by the summer sun create hotter water next to the sediment – and thus melt things faster. These points are confirmed by the observations of rapid bubble generation and much larger funnels from which methane flows in the 20-200 m range.

In the fourth place, the ability of 20-200 m methane bubbles to rise to the surface means that we probably grossly underestimate the percentage of emitted methane that will rise into the atmosphere as methane rather than carbon dioxide. Frankly, this is probably good news, since that means less of it will eventually stay in the atmosphere as carbon dioxide – but there’s still a chance it may stay as methane for a long time – and be far worse for global warming. This would happen if there’s too much methane up there and the OH in the atmosphere that removes a lot of that methane runs out, a possibility some scientists have raised.

In the fifth place, the existence of pockets indicates that methane emissions may be bursty, as surface melt “burns through” to those pockets that are themselves melted, but are still trapped by frozen clathrates above. Those bursts should be frequent enough to keep methane emissions at a higher yearly rate.

Implications

Overall, this makes me a little more hopeful about overall methane emissions and their effect on global warming. While there is perhaps 2-4 times the amount of methane clathrates to emit than I thought, there may be 30-70% lower emission rates than I anticipated, and that’s the key to methane’s overall effect in the next 160 years, when it matters. To put it another way, the net methane emission rates per year should be lower than I expected, and the amount in the atmosphere as methane in the next 160 years should be lower than my worst-case all-methane scenario (assuming there’s enough OH). Hopefully, the amount of carbon dioxide should be lower as well, because of the decreased yearly emissions amount and increased percentage arriving as methane (only half of that turns into carbon dioxide). However, this isn’t sure, because if the OH runs out, the effect of yet more “steady state” methane over, say, 600 years on global warming will be worse than I had anticipated.

All in all, I would now tend to put the likely overall new natural-source methane emission effects (also including permafrost and wetlands) in the 3-6 degree C range over the next over the next 200 years, and in the 2-5 degree Celsius range in the 400 years after that – overall, perhaps a 25% boost to global warming rather than a 50% one, protracted over more years. High water may not be delayed, but hell may be a little less hellish in temperature, and the end of life on earth ever so slightly less likely, than I feared.

Unless, of course, the OH in the atmosphere runs out … the worries never end, do they?

Saturday, January 14, 2012

A Few Books That "Rocked My World"

I recently saw a list of “100 books that rocked my world” from a blogger. It turned out to be a list of “books that are really cool” and not “books that made me think differently in fundamental ways.” So I thought I’d look back and ask myself, years later, what books changed me fundamentally?

1. Godel’s Proof, Nagel and Freeman. As Shaw says in “Man and Superman”, it made me want to “think more, so that I would be more.” The idea of there being some things that I will never know or prove, is something that I am still wrestling with.

2. Spark, Ratey. The idea that we fluctuate chemically between addiction to pessimism and addiction to optimism based on whether we are channeling our hunter ancestors seeking prey by exercising in company, between learning by moderate physical stress and forgetting based on sedentary habits, between inoculating against disorders by moderately poisoning ourselves with food and moderately stressing ourselves with exercise and opening ourselves to disease and death by eating unchallenging foods and avoiding challenging exercise, seems to apply to and alter every aspect of my life.

3. The Age of Diminished Expectations, Krugman. It began to give me an ability not only to understand my sense of disconnect between conventional economics and what was happening to me in the real world, but to apply new tools to understand and improve the broad scope of my life in terms of money and generational cycles – something I’d been looking for over 20 years of college and work. Of course, I needed several additional books and articles to understand the full scope of Krugman’s approach.

4. The Fellowship of the Ring, Tolkien. I wasn’t expecting what I got when I scrounged my Dad’s library for yet more books at age 12. Suddenly, I was able to see the non-human world around me as a separate, interconnected, wonderful, alive thing. And the idea that you could frame a book or part of a life as the necessary preface leading to the beginning of a journey – like Frodo’s, stripped of his teachers, into Mordor – made me see that my life could be seen that way – and that has always given me hope.

5. Falconer, Cheever. This one is painful. I had to ask myself, after it was over, am I, like the protagonist, seeing my relationships with women too much in terms of my own needs, or can I finally grow up? I don’t know how much I’ve changed in my behavior after reading it; but I know I can never think the way I used to about relationships without far greater discomfort and dissatisfaction with myself.

6. How to Win Friends and Influence People, Dale Carnegie. There are many, many things wrong with this book, as I have come to realize. But it gave me the humility of understanding that my artsy and intellectual achievements were of very little value to others, and showed me that I really liked people, if I just listened to them. It also gave me a basis for understanding people’s social viewpoint that has allowed me to connect, slowly, over 30 years.

7. A Connecticut Yankee at King Arthur’s Court, Twain. I don’t think I recognized this at the time, but it was my introduction to what I might call the “science fiction viewpoint”: the idea that by facing scientific facts and leveraging technology, you could make a fundamental change for good in the world – the true meaning of “progress”. I can never quite shake that idea, and it has led me in quite a different direction from the rest of my family, into math and computers, and away from Great Literature that had few solutions to offer. Of course, I never quite accepted Twain’s other idea: that this progress could vanish like the mist from history, resisted by willfully ignorant humans frightened of change, unless you were lucky.

Friday, January 13, 2012

The Other BI: HP Vertica and Columnar Databases

This blog post highlights a software company and technology that I view as potentially useful to organizations investing in business intelligence (BI) and analytics in the next few years. Note that, in my opinion, this company and solution are not typically “top of the mind” when we talk about BI today.

The Importance of Vertica-Type Columnar Database Technology to BI

Last year, I wrote a blog post saying that it was likely that HP would underestimate the columnar database technology in Vertica, and if so they were missing a major opportunity. In the last year, HP has been pretty quiet about Vertica, but I have partially changed my mind, to the point where I want to call attention to Vertica as a less visible candidate for IT buyers to get the full benefits of columnar database technology over the next 2-3 years.

Let’s start with columnar technology. Here, I want to go more in-depth into Vertica’s core technology than usual, because it’s an excellent way to begin to see the benefits of columnar beyond traditional row-oriented databases.

The original idea of Vertica was to recast the relational database to focus on the (data warehousing) case where there are few if any updates. The redesign started with the idea that the data should be stored in "columns" rather than rows; the details of this are that the columns themselves (because they don't have to follow relational dogma) can be stored in a highly compressed format, with lots of compression techniques like inverted list, bit-mapped indexing, and hashing, as appropriate. Thus, (a) the database can use the column format to zero in faster on the data that the query is gathering, (b) because the data is compressed an average of 10 times (according to Vertica), more data can be crammed into main memory for faster processing. Result: a claimed 10-100 times speedup in performance, comparable to in-memory databases but far more scalable. It also means the database can handle at least 10 times more data (say, 100 terabytes instead of 5) with the same performance for a given query; or that the data center can use an order of magnitude less storage.

Now, all this does not come without a cost, and the typical cost would at first seem to be speed of updating. That is, the column storage format requires more revision of the data stored on disk when an update arrives, so update is slower. But this is counteracted by the ability to load more of localized data at once into main memory in a compressed form, for faster in-memory updating. Only at update frequencies typical of old-style operational online transaction processing (OLTP) does the row-oriented relational database have a clear edge.

The elaboration of the design in Vertica is that the basic data is also stored as "projections" (aka materialized views). That is, a set of columns in a tuple is stored one (relational) way; each column also shows up in a projection, but the projection is cross-tuple (one from tuple A, one from tuple B, etc.). This accomplishes two things: one, it gives an alternative way of querying which may be faster than basic storage, and two, it gives redundancy and therefore robustness, in a similar way to RAID 5 (projections can be "striped" across disks).

Now, here's where things get really interesting. Practically speaking, today, in data-warehousing-type databases, updates via "load windows" are becoming more and more frequent, to the point where data is pretty up-to-date and updates are a bigger part of data warehousing. To keep "write locks" from gumming up performance (especially with column update being slower), Vertica splits the storage into a write-optimized column store (WOS; effectively, a cache) and a Read-optimized Column Store (ROS). Periodically, the WOS becomes the ROS. So the write locks for the updates only interfere with reads when there’s a mass update. At the same time, such a mass update can re-store whole chunks of the ROS for optimum storage efficiency. Moreover, to gain currency, the query can be carried out across the ROS and WOS. And, because there is all this redundancy, there is no need for logs—another performance improvement. Note that because of its redundancy, Vertica doesn't need to do roll-back/roll-forward nor backup/restore.

The net of all this for IT buyers is that columnar databases in general, and Vertica in particular, should be able to deliver on average much better performance than traditional relational databases in the majority of not-highly-update-intensive cases, due mostly to its compression abilities, and that addition of other technologies like in-memory technology to both alternatives will not alter this superiority.

The Relevance of HP Vertica to BI

This kind of approach cries out for integration with or development of sophisticated admin tools, expansion beyond data warehousing and analytics to “mixed” transactions in competition with the noSQL fad, better programming tools to build up a war chest of business/industry customized solutions, and using a relational database as an OLTP complement. The resulting data-management platform would be a solid alternative for all sizes of enterprise to the “relational fits all” or “let the thousand flowers bloom” strategies of most organizations.

Once this platform is in place, it needs to become the keystone of enterprise architectures, not just an analytics or business intelligence “super-scaling” engine. That means adding integration with semi-structured and unstructured data. It also means adding major functionality for handling content, and integration with storage software for additional performance optimization. And so, anticipating that HP would not do this, I criticized the HP acquisition of Vertica last year.

Well, two things happened: HP did more than I thought it would, and competitors did less. HP bought a company called Autonomy, which added semi-structured/unstructured data support. Necessarily, this takes Vertica beyond pure data-warehousing-style analytics into a more update-intensive world, and HP’s redirection of Mercury Interactive towards agile ALM (application lifecycle management) associated Vertica with better programming tools. Meanwhile, SAP took its eye off Sybase IQ with its focus on HANA, IBM at least temporarily walked away from its Netezza semi-columnar database technology, and Oracle’s columnar-optional appliance ran into questions about its long-term hardware growth path. In other words, the result of half a loaf from HP and less than half a loaf from everyone else is that Vertica is moving towards leadership status in delivering columnar database technology to all scales of BI and analytics.

Meanwhile, of course, only the deluded think that HP will suddenly vanish, while database technology and the rest of the new software embed themselves ever deeper in HP’s DNA. HP Vertica is going to be around for quite a while; and it will be an attractive option for quite a while.

Potential Uses of Vertica-Type Columnar-Based BI for IT

The use cases of a columnar database IT is straightforward. IT should use a columnar database in new projects as an alternative or complement to a traditional relational database, unless the operations are update-intensive, in which case row-oriented relational is preferred. As a complement, columnar databases operate on a “switching” basis, in which an overall engine decides which queries should be allocated to row-oriented, which to columnar, usually on the basis of whether two or more of the “fields” involved in an operation can be compressed highly by using a columnar format. Oracle (and, until recently, IBM Netezza) takes this approach; but IT can also do its own switching mechanism.

And that’s it. Over the next 2-3 years, if not already, columnar can scale as high as querying, can integrate with as many data types and upper-level tools and applications, and can evolve to greater performance/scalability just as rapidly as the traditional row-oriented database. In the long run, in a lot of use cases, and sometimes in the short run, that favors Vertica-type columnar.

However, right now, columnar requires in some cases to “grow into” its assigned role in a new project, by adding administrative tools for particular cases. Therefore, in most applications where 24x7 operation and an adequate level of customer response time is business-critical, relational row-oriented should still be preferred. That should leave plenty of analytical and other BI uses for which Vertica-type columnar database software will deliver an important performance advantage.

The Bottom Line for IT Buyers

Over the next few years, IT buyers can take one of two views: the author of this blog post is prescient, columnar will replace row-oriented in the majority of new applications in BI and other areas, and we should include columnar in all our short lists from now on; or, the author of this blog post is wrong about the future, but columnar is useful for some things right now, and trying to standardize on one database is a fool’s game that we no longer bother to try to play. If IT buyers hold the second view, then they should be focused on applying columnar to analysis of huge amounts of structured data with “sparse” fields where high compression is achievable – like five-field customer names (Mr. John Taylor Jakes, Jr.) and product codes. Spend the resulting improvements on increased performance, lowered storage costs, or both.

Again, this is not a matter of a pre-short list, unless you have a “gray area” BI project involving somewhat update-intensive or somewhat business-critical little-downtime apps, in which case you want to wait for columnar to evolve a little. In all other cases, HP Vertica should go on the short list along with the obvious others, like Sybase IQ. Right now, Vertica appears to be ahead both in some of the needed features to adapt to new analytics needs and in speed of evolution. One never knows – but over the next year, that leadership role may continue.

Above all, IT buyers should not listen to any FUD from traditional relational vendors suggesting that this is yet another new technology, like object databases, that will eventually fall to earth with a thud. Columnar database technology proved its superiority in many situations long ago in the non-relational world, with CCA’s Model 204, and has found uses continuously since then, like bit-mapped indexing. Most times there’s a fair BI matchup, as with some of the TPC benchmarks of the last seven years, columnar comes out well ahead. Under whatever name, columnar database technology is not going away. Therefore, its markets will continue to grow relative to row-oriented relational. For IT buyers, acquiring columnar BI solutions like HP’s Vertica is simply being smart and getting a little ahead of the curve.