Wednesday, March 28, 2012

Oracle, EMC, IBM, and Big Data: Avoiding The One-Legged Marathon

Note: this is a post of an article published in Nov. 2011 in another venue, and the vendors profiled here have all upgraded their Big Data stories significantly since then. Imho, it remains useful as a starting point for assessing their Big Data strategies, and deciding how to implement a long-term IT Big Data strategy oneself.

In recent weeks, Oracle, EMC, and IBM have issued announcements that begin to flesh out in solutions their vision of Big Data, its opportunities, and its best practices. Each of these vendor solutions has significant advantages for particular users, and all three are works in progress.

However, as presented so far, Oracle’s and EMC’s approaches appear to have significant limitations compared to IBM’s. If the pair continues to follow the same Big Data strategies, I believe that many of their customers will find themselves significantly hampered in dealing with certain types of Big Data analysis over the long run – an experience somewhat like voluntarily restricting yourself to one leg while running a marathon.

Let’s start by reviewing the promise and architectural details of Big Data, then take a look at each vendor’s strategy in turn.

Big Data, Big Challenges
As I noted in an earlier piece, some of Big Data is a relabeling of the incessant scaling of existing corporate queries and their extension to internal semi-structured (e.g., corporate documents) and unstructured (e.g., video, audio, graphics) data. The part that matters the most to today’s enterprise, however, is the typically unstructured data that is an integral part of customers’ social-media channel – including Facebook, instant messaging, Twitter, and blogging. This is global, enormous, very fast-growing, and increasingly integral (according to IBM’s recent CMO survey) to the vital corporate task of engaging with a "customer of one" throughout a long-term relationship.

However, technically, handling this kind of Big Data is very different from handling a traditional data warehouse. Access mechanisms such as Hadoop/MapReduce combine open-source software, large amounts of small or PC-type servers, and a loosening of consistency constraints on the distributed transactions (an approach called eventual consistency). The basic idea is to apply Big Data analytics to queries where it doesn’t matter if some users get "old" rather than the latest data, or if some users get an answer while others don’t. As a practical matter, this type of analytics is also prone to unexpected unavailability of data sources.

The enterprise cannot treat this data as just another BI data source. It differs fundamentally in that the enterprise can be far less sure that the data is current – or even available at all times. So, scheduled reporting or business-critical computing based on Big Data is much more difficult to pull off. On the other hand, this is data that would otherwise be unavailable for BI or analytics processes – and because of the approach to building solutions, should be exceptionally low-cost to access.

However, pointing the raw data at existing BI tools would be like pointing a fire hose at your mouth, with similarly painful results. Instead, the savvy IT organization will have plans in place to filter Big Data before it begins to access it.

Filtering is not the only difficulty. For many or most organizations, Big Data is of such size that simply moving it from its place in the cloud into an internal data store can take far longer than the mass downloads that traditionally lock up a data warehouse for hours. In many cases, it makes more sense to query on site and then pass the much smaller result set back to the end user. And as the world of cloud computing keeps evolving, the boundary between "query on site" and "download and query" keeps changing.

So how are Oracle, EMC, and IBM dealing with these challenges?

Oracle: We Control Your Vertical
Reading the press releases from Oracle OpenWorld about Oracle’s approach to Big Data reminds me a bit of the old TV show Outer Limits, which typically began with a paranoia-inducing voice intoning "We control your horizontal, we control your vertical …" as the screen began flickering.

Oracle’s announcements included specific mechanisms for mass downloads to Oracle Database data stores in Oracle appliances (Oracle Loader for Hadoop), so that Oracle Database could query the data side-by-side with existing enterprise data, complete with data-warehouse data-cleansing mechanisms.

The focus is on Oracle Exalytics BI Machine, which combines Oracle Database 11g and Oracle’s TimesTen in-memory database for additional BI scalability. In addition, there is a "NoSQL" database that claims to provide "bounded latency" (i.e., it limits the "eventual" in "eventual consistency"), although how it should combine with Oracle’s appliances was not clearly stated.

The obvious advantage of this approach is integration, which should deliver additional scalability on top of Oracle Database’s already-strong scalability. Whether that will be enough to make the huge leap to handling hundred-petabyte data stores that change frequently remains to be seen.

At the same time, these announcements implicitly suggest that Big Data should be downloaded to Oracle databases in the enterprise, or users should access Big Data via Oracle databases running in the cloud, but provide no apparent way to link cloud and enterprise data stores or BI. To put it another way, Oracle is presenting a vision of Big Data used by Oracle apps, accessed by Oracle databases using Oracle infrastructure software and running on Oracle hardware with no third party needed. We control your vertical, indeed.

What also concerns me about the company’s approach is that there is no obvious mecha-nism either for dealing with the lateness/unavailability/excessive lack of quality of Big Data, or for choosing the optimal mix of cloud and in-enterprise data location. There are only hints: the bounded latency of Oracle NoSQL Database, or the claim that Oracle Data Integrator with Application Adapter for Hadoop can combine Big Data with Oracle Database data – in Oracle Database format and in Oracle Database data stores. We control your horizontal, too. But how well are we controlling it? We’ll get back to you on that.

EMC: Competing with the Big Data Boys
The recent EMC Forum in Boston in many ways delivered what I regard as very good news for the company and its customers. In the case of Big Data, its acquisition of Greenplum with its BI capabilities led the way. And Greenplum finally appears to have provided EMC with the data management smarts it has always needed to be a credible global information management solutions vendor. In particular, Greenplum is apparently placing analytical intelligence in EMC hardware and software, giving EMC a great boost in areas such as being able to handle querying within the storage device and monitoring distributed systems (such as VCE’s VBlocks) for administrative purposes. These are clearly leading-edge, valuable features.

EMC’s Greenplum showed itself to be a savvy supporter of Big Data. It supports the usual third-party suspects for BI: SQL, MapReduce, and SAS among others. Querying is "software shared-nothing", running in virtual machines on commodity VBlocks and other scale-out/grid x86 hardware. Greenplum has focused on the fast-deploy necessities of the cloud, claiming a 25-minute data-model change – something that has certainly proved difficult in large-scale data warehouses in the past.

Like Oracle, Greenplum offers "mixed" columnar and row-based relational technology; un-like Oracle, it is tuned automatically, rather than leaving it up to the customer how to mix and match the two. However, its answer for combining Big Data and enterprise data is also tactically similar to Oracle’s: download into the Greenplum data store.

Of our three vendors, EMC via Greenplum has been the most concrete about what one can do with Big Data, offering specific support for combining social graphing, customer-of-one tracking of Twitter/Facebook posts, and mining of enterprise customer data. The actual demo had an unfortunate "1984" flavor, however, with a customer’s casual chat about fast cars being used to help justify doubling his car insurance rate.

The bottom line with Greenplum appears to be that its ability to scale is impressive, even with Big Data included, and it is highly likely to provide benefits out of the box to the savvy social-media analytics implementer. Still, it avoids, rather than solves, the problems of massive querying across Big Data and relational technology – it assumes massive downloads are possible and data is "low latency" and "clean", where in many cases it appears that this will not be so. EMC Greenplum is not as "one-vendor" a solution as Oracle’s, but it does not have the scalability and robustness track record of Oracle, either.

IBM: Avoiding Architectural Lock-In
At first glance, IBM appears to have a range of solutions for Big Data similar to Oracle and EMC – but more of them. Thus, it has the Netezza appliance; it has InfoSphere BigInsights for querying against Hadoop; it has the ability to download data into both its Informix/in-memory technology and DB2 databases for in-enterprise data warehousing; and it offers various Smart Analytics System solutions as central BI facilities.

Along with these, it provides Master Data Management (InfoSphere MDM), data-quality features (InfoSphere DataStage), InfoSphere Streams for querying against streaming Web sensor data (like mobile GPS), and the usual quick-deployment models and packaged hardware/software solutions on its scale-out and scale-up platforms. And, of course, everyone has heard of Watson – although, intriguingly, its use cases are not yet clearly pervasive in Big Data implementations.

To my mind, however, the most significant difference in IBM’s approach to Big Data is that it offers explicit support for a wide range of ways to combine Big-Data-in-place and enterprise data in queries. For example, IBM’s MDM solution allows multiple locations for customer data and supports synchronization and replication of linked data. Effectively used, this allows users to run alerts against Big Data in remote clouds or dynamically shift customer querying between private and public clouds to maximize performance. And, of course, the remote facility need not involve an IBM database, because of the MDM solution’s cross-vendor "data virtualization" capabilities.

Even IBM’s traditional data warehousing solutions are joining the fun. The IBM IOD conference introduced the idea of a "logical warehouse", which departs from the idea of a single system or cluster that contains the enterprise’s "one version of the truth", and towards the idea of a "truth veneer" that looks like a data warehouse from the point of view of the analytics engine but is actually multiple operational, data-warehouse, and cloud data stores. And, of course, IBM’s Smart Analytics Systems run on System x (x86), Power (RISC) and System z (mainframe) hardware.

On the other hand, there are no clear IBM guidelines for optimizing an architecture that combines traditional enterprise BI with Big Data. It gives one the strong impression that IBM is providing customers with a wide range of solutions, but little guidance as to how to use them. That IBM does not move the customer towards an architecture that may prevent effective handling of certain types of Big-Data queries is good news; that IBM does not yet convey clearly how these queries should be handled, not so much.

Composite Software and the Missing Big-Data Link
One complementary technology that users might consider is data virtualization (DV), as provided by vendors such as Composite Software and Denodo. In these solutions, subqueries on data of disparate types (such as Big Data and traditional) are optimized flexibly and dynamically, with due attention to "dirty" or unavailable data. DV solution vendors’ accrued wisdom can be simply summed up: usually, querying on-site instead of doing a mass download is better.

How to deal with the "temporal gap" between "eventually consistent" Big Data and "hot off the press" enterprise sales data is a matter for customer experimentation and fine-tuning, but that customer can always decide what to do with full assurance that subqueries have been optimized and data cleansed in the right way.

The Big Data Bottom Line
To me, the bottom line of all of this Big Data hype is that there is indeed immediate business value in there, and specifically in being able to go beyond the immediate customer interaction to understand the customer as a whole and over time – and thereby to establish truly win-win long-term customer relationships. Simply by looking at the social habits of key consumer "ultimate customers," as the Oracle, EMC, and IBM Big Data tools already allow you to do, enterprises of all sizes can fine-tune their interactions with the immediate customer (B2B or B2C) to be far more cost-effective.

However, with such powerful analytical insights, it is exceptionally easy to shoot oneself in the foot. Even skipping quickly over recent reports, I can see anecdotes of "data overload" that paralyze employees and projects, trigger-happy real-time inventory management that actually increases costs, unintentional breaches of privacy regulations, punitive use of newly public consumer behavior that damages the enterprise’s brand or perceived "character", and "information overload" on the part of the corporate strategist.

The common theme running through these user stories is a lack of vendor-supplied context that would allow the enterprise to understand how to use the masses of new Big Data properly, and especially an understanding of the limitations of the new data.

Thus, in the long run, the best performers will seek a Big-Data analytics architecture that is flexible, handles the limitations of Big Data as the enterprise needs them handled, and allows a highly scalable combination of Big Data and traditional data-warehouse data. So far, Oracle and EMC seem to be urging customers somewhat in the wrong direction, while IBM is providing a "have it your way" solution; but all of their solutions could benefit strongly from giving better optimization and analysis guidance to IT.

In the short run, users will do fine with a Big-Data architecture that does not provide an infrastructure support "leg" for a use case that they do not need to consider. In the long run, the lack of that "leg" may be a crippling handicap in the analytics marathon. IT buyers should carefully consider the long-run architectural plans of each vendor as they develop.

Tuesday, March 20, 2012

Agility and Sustainability: Complement or Clash?

It seems to me that both the concept of “agility” and the concept of “sustainability” have reached the point of real-world use where they will begin to intersect, especially in businesses, and where we will need to think about whether one should be subordinated to the other – because sustainability projects may detract from efforts towards agility, or agility efforts might detract from moves toward sustainability. Or, it may be that, fundamentally, the two complement each other, and can act in synergy. Since agility, at least, has proven its long-term worth in bottom-line results, and sustainability may turn out to be as necessary as the air we wish to breathe, the answer matters.

What follows is very far removed from daily reality. If business agility is fully achieved in a business – and we are so far from such a business that it is hard to say if it will ever exist – and if sustainability really takes over as a global business strategy and economic system – and we still do not know how such a system can function in the long term – then the relationship between the two will be a major issue. Still, some sort of answer would be helpful right now, so that as these two business and societal cultures lurch slowly into possible ubiquity, we can smooth the way for either/both. All we can do is paint with a broad brush; well, let’s get to it.

Setting the Stage: Similarities and Differences

Let’s start with business agility in its extreme form. Such a business (or society) is not focused on success as measured in money or power, but rather on changing rapidly and effectively. The reason this works is that all our institutions and thoughts bring us back to considering success – what we need from agility is a reward for constantly moving on from success. What grounds the agile business is constant negotiation with its customers and the environment as a whole, in which first one, then the other takes the lead in saying what to do next.

The mind shift described here is not as great as it sounds. In any given business, people want to learn, as long as it’s “safe”: it’s accepted by the people around you, people value you for being able to change as well as for specific accomplishments, you have some idea what to change to (i.e., the customer or end user gives you lots of suggestions), and the success rate of this type of thinking is high.

Such an approach has not only enormous positive side-effects on traditional but real measures of success – revenue, cost reduction, profit, customer satisfaction, power – but also great long-term psychic rewards – constant learning with constant positive feedback that what you are doing matters to other people as well as you, which can in some cases only end with the end of your life. In other words, once introduced, business agility can feed on itself – as long as it can get past the initial major mind and culture shift, which involves new metrics, new daily work routines, and baffling encounters with other parts of the company who are doing the same agility thing in a different way. But it all works out, because the work is more fun and because the new agile you starts treating those other parts of the company as customers.

Now let’s switch to the extreme form of sustainability. Wikipedia’s definition continually circles back to: preservation of a system’s ability to endure. More practically, this means: don’t put a system into such a state that it will break down. Usually, breakdown is done by scaling beyond the system’s ability to be patched, or destroying enough of the system’s infrastructure that it can no longer function. So, in my own definition, sustainability means creating a system that will not scale beyond a certain limit, forever, and will not destroy necessary infrastructure. All that we do, all the ways that we grow and change, are forever bounded by those limits – so a sustainable system is one in which not only are limits enforced, but the system itself encourages no violation of those limits. And yet, how do you do that without eventually banning all important change?

Well, strictly speaking, you can’t. And yet, you can indeed get so close to that, that we might well defer the problem until the end of the universe or thereabouts. The answer is equal parts “flight to the huge” and “flight to the virtual.” By that I mean, there are enough physical resources of certain types in the universe (like sunlight) that are beyond our capacity to exhaust for a very long time, and there is enough computer capacity even now that we can shift the things we do from physical to virtual for a very long time, and when we combine the two we see that most other limits that we are approaching can be handled for a huge amount of time by shifting them into the huge and virtual buckets as fast as we ramp up our systems.

The obvious example is putting carbon in the atmosphere. In a fairy tale world, we figure out how to capture and store sunlight with maximum effectiveness and level out its use, shifting from fossil fuels to handle our expanding use of energy as we increase in numbers and prosper. However, that’s not enough; we also need to wipe out that increase of energy usage. And so, we shift from physical to virtual, changing the content of our physical machines to include greater and greater amounts of software, avoiding the need for physical machines by accomplishing much of their tasks by software. This is not about making the physical worker useless before an omnipotent machine; this is about 6 billion people with 6 billion “virtual machines” doing more than what 5 billion people with 1 billion “virtual machines” were doing before, but with the same number of real machines and the same energy usage, ad infinitum. That’s “close enough” sustainability: you are not trapped in a pond, circling the same shores forever, but always flowing down a river that appears to be forever widening and changing, with enormous shores on either side and a riverbed that never deepens.

And now we can see the similarities and the differences between this type of sustainability and this type of agility. Here are the similarities:

• Both are all-encompassing mind sets, cultures, and processes that are clearly quite different from the ones we have now.

• Both are likely to be more effective – help us to survive more, help us to be better off physically and psychologically – than what we are doing today, in the long term.

• Both allow for growth and change – the growth and change that has brought us this far.

But, here are the differences:

• Agility is a more positive message: there’s always something to learn, something better to do; while sustainability is a more negative message: there are limits, and life is about staying within those limits.

• Agility is open and more uncertain, you assume that there will always be new things to do, but you have no idea what they are; sustainability is closed and more certain, as the things that matter are avoiding breakdowns, which are typically bounded problems that are physical, definable, and measurable.

• Fundamentally, agility focuses on changing and ignores survivability, and so improves your chance to survive; sustainability limits change in the name of survival, and who knows what that does to your ability to grow and change?

In other words, agility and sustainability can be seen as rivals for the better business, culture, and society of the future, which can all too easily either undercut the other in their competition for being the One True Way, or fight so fiercely that neither is achieved. And, as I noted in the introduction, either kind of “clash” outcome will eventually matter a lot.

Initial Thoughts on Synergy

It seems to me that there are two approaches that could ensure that, despite this, both agility and sustainability can thrive, and perhaps even reinforce each other. I call them the tactical and the strategic.

The tactical approach tries to manage combining agility and sustainability on a case-by-case basis. This means, for example, setting the boundaries of sustainability very clearly, so that they are intuitive for all, for the business as well as the society, and enforcing them strongly when clearly violated, but allowing endless agility within those boundaries. Carbon pricing is an example of the tactical approach: if effectively administered, it constantly sets clear monetary boundaries on carbon use, within a sustainable carbon yearly limit. Do that, says the tactical approach, and then let the business be as agile as it wishes.

The strategic approach, by contrast, seeks to create an integrated mindset combining agility and sustainability. One way to do this that I view as feasible is to get people used to the idea that trying to do the impossible is not the most effective way of doing things. As I have previously written, “working smarter, not harder” comes down to “instead of trying the impossible, be smart and focus your (in this case, change ideas) on things that are possible.” Knowing short-term and long-term limits becomes an essential part of the really agile business, because it allows people to increase the number of new ideas that are really good ones. But then, the strategic approach involves yet a third mind shift, to trying to work smarter instead of harder and seeing limits as positive idea-producers; how many mind shifts can we do, anyway?

Summing Up

My overall conclusion is brief: Tentatively, I lean towards taking a strategic approach to melding agility and sustainability, and applying it wherever frictions between the two crop up; but my main conclusion is that we had better start thinking more about tactical and strategic approaches, and how to do them, right now. It may be that the successes of agility will lead to a mindset that treats sustainability as yet another source of new ideas, and so we muddle through to survival; but I wouldn’t count on it. Much better to be proactive even when agility doesn’t call for it, and think about how to combine carbon metrics and agility metrics. In the meanwhile, I’ll continue preaching agility – because it’s fun – and keeping a wary eye on sustainability – because we may need it sooner than we think.

That’s as far as I can get. Ideas, anyone?

Friday, March 16, 2012

Information Agility and Data Virtualization: A Thought Experiment

In an agile world, what is the value of data virtualization?

For a decade, I have been touting the real-world value-add of what is now called data virtualization – the ability to see disparate data stores as one, and to perform real-time data processing, reporting, and analytics on the “global” data store – in the traditional world of IT and un-agile enterprises. However, in recent studies of agile methodologies within software development and in other areas of the enterprise, I have found that technologies and solutions well-adapted to deliver risk management and new product development in the traditional short-run-cost/revenue enterprises of the past can actually inhibit the agility and therefore the performance of an enterprise seeking the added advantage that true agility brings. In other words: your six-sigma quality-above-all approach to new product development may seem as if it’s doing good things to margins and the bottom line, but it’s preventing you from implementing an agile process that delivers far more.

In a fully agile firm – and most organizations today are far from that – just about everything seems mostly the same in name and function, but the CEO or CIO is seeing it through different eyes. The relative values of people, processes, and functions change, depending no longer on their value in delivering added profitability (that’s just the result), but rather on the speed and effectiveness with which they evolve (“change”) in a delicate dance with the evolution of customer needs.

What follows is a conceptualization of how one example of such a technology – data virtualization – might play a role in the agile firm’s IT. It is not that theoretical – it is based, among other things, on findings (that I have been allowed by my employer at the time, Aberdeen Group, to cite) about the agility of the information-handling of the typical large-scale organization. I find that, indeed, if anything, data virtualization – and here I must emphasize, if it is properly employed – plays a more vital role in the agile organization.

Information Agility Matters

One of the subtle ways that the ascendance of software in the typical organization has played out is that, wittingly or unwittingly, the typical firm is selling not just applications (solutions) but also information. By this, I don’t just mean selling data about customers, the revenues from which the Facebooks of the world fatten themselves on. I also mean using information to attract, sell, and follow-on sell customers, as well as selling information itself (product comparisons) to customers.

My usual example of this is Amazon’s ability to note customer preferences for certain books/music/authors/performers, and to use that information to pre-inform customers of upcoming releases, adding sales and increasing customer loyalty. The key characteristic of this type of information-based sale is knowledge of the customer that no one else can have, because it involves their sales interactions with you. And that, in turn, means that the information-led sell is on the increase, because the shelf life of application comparative advantage is getting shorter and shorter – but comparative advantage via proprietary information, properly nurtured, is practically unassailable. In the long run, applications keep you level with your competitors; information lets you carve out niches in which you are permanently ahead.

This, however, is the traditional world. What about the agile world? Well, to start with, the studies I cited earlier show that in almost any organization, information handling is not only almost comically un-agile but also almost comically ineffective. Think of information handling as an assembly line (yes, I know, but here it fits). More or less, the steps are:

1. Inhale the data into the organization (input)

2. Relate it to other data, including historical data, so the information in it is better understood (contextualize)

3. Send it into data stores so it is potentially available across the organization if needed (globalize)

4. Identify the appropriate people to whom this pre-contextualized information is potentially valuable, now and in the future, and send it as appropriate (target)

5. Display the information to the user in the proper context, including both corporate imperatives and his/her particular needs (customize)

6. Support ad-hoc additional end-user analysis, including non-information-system considerations (plan)

The agile information-handling process, at the least, needs to add one more task:

7. Constantly seek out new types of data outside the organization and use those new data types to drive changes in the information-handling process – as well, of course, as in the information products and the agile new-information-product-development processes (evolve).

The understandable but sad fact is that in traditional IT and business information-handling, information “leaks” and is lost at every stage of the process – not seriously if we only consider any one of the stages, but (by self-reports that seem reasonable) yielding loss of more than 2/3 of actionable information by the end of steps one through six. And “fixing” any one stage completely will only improve matters by about 11% -- after which, as new types of data arrive, that stage typically begins leaking again.

Now add agile task 7, and the situation is far worse. By self-reports three years ago (my sense is that things have only marginally improved), users inside an organization typically see important new types of data on average ½ year or more after they surface on the Web. Considering the demonstrated importance of things like social-media data to the average organization, as far as I’m concerned, that’s about the same as saying that one-half the information remaining after steps one to six lacks the up-to-date context to be truly useful. And so, the information leading to effective action that is supplied to the user is now about 17% of the information that should have been given to that user.

In other words, task 7 is the most important task of all. Stay agilely on top of the Web information about your customers or key changes in your regulatory, economic, and market environment, and it will be almost inevitable that you will find ways to improve your ability to get that information to the right users in the right way – as today’s agile ad-hoc analytics users (so-called business analysts) are suggesting is at least possible. Fail to deliver on task 7, and the “information gap” between you and your customers will remain, while your information-handling process falls further and further out of sync with the new mix of data, only periodically catching up again, and meanwhile doubling down on the wrong infrastructure. Ever wonder why it’s so hard to move your data to a public cloud? It’s not just the security concerns; it’s all that sunk cost in the corporate data warehouse.

In other words, in information agility, the truly agile get richer, and the not-quite-agile get poorer. Yes, information agility matters.

And So, Data Virtualization Matters

It is one of the oddities of the agile world that “evolution tools” such as more agile software development tools are not less valuable in a methodology that stresses “people over processes”, but more valuable. The reason why is encapsulated in an old saying: “lead, follow, or get out of the way.” In a world where tools should never lead, and tools that get out of the way are useless, evolution tools that follow the lead of the person doing the agile developing are priceless. That is, short of waving a magic wand, they are the fastest, easiest way for the developer to get from point A to defined point B to unexpected point C to newly discovered point D, etc. So one should never regard a tool such as Data Virtualization as “agile”, strictly speaking, but rather as the best way to support an agile process. Of course, if most of the time a tool is used that way, I’m fine with calling it “agile.”

Now let’s consider data virtualization as part of an agile “new information product development” process. In this process, the corporation’s traditional six-step information-handling process is the substructure (at least for now) of a business process aimed at fostering not just better reaction to information but also better creation and fine-tuning of combined app/information “products.”

Data virtualization has three key time-tested capabilities that can help in this process. First is auto-discovery. From the beginning, one of the nice side benefits of data virtualization has been that it will crawl your existing data stores – or, if instructed, the Web – and find new data sources and data types, fit them into an overall spectrum of data types (from unstructured to structured), and represent them abstractly in a metadata repository. In other words, data virtualization is a good place to start proactively searching out key new information as it sprouts outside “organizational boundaries”, because they have the longest experience and the best “best practices” at doing just that.

Data virtualization’s second key agile-tool capability is global contextualization. That is, data virtualization is not a cure-all for understanding all the relationships between an arriving piece of data and the data already in those data stores; but it does provide the most feasible way of pre-contextualizing today’s data. A metadata repository has been proven time and again to be the best real-world compromise between over-abstraction and zero context. A global metadata repository can handle any data type thrown at it. For global contextualization, a global metadata repository that has in it an abstraction of all your present-day key information is your most agile tool. It does have some remaining problems with evolving existing data types; but that’s a discussion for another day, not important until our information agility reaches a certain stage that is still far ahead.

Data virtualization’s third key agile-tool capability is the “database veneer.” This means that it allows end users, applications, and (although this part has not been used much) administrators to act as if all enterprise data stores were one gigantic data store, with near-real-time data access to any piece of data. It is a truism in agile development that the more high-level the covers-all-cases agile software on which you build, the more rapidly and effectively you can deliver added value. The database veneer that covers all data types, including ones that are added over time, means more agile development of both information and application products on top of the veneer. Again, as with the other two characteristics, data virtualization is not the only tool out there to do this; it’s just the one with a lot of experience and best practices to start with.

And, for some readers, that will bring up the subject of master data management, or MDM. In some agile areas, MDM is the logical extension of data virtualization, because it begins to tackle the problems of evolving old data types and integrating new ones in a particular use case (typically, customer data) – which is why it offers effectively identical global contextualization and database-veneer capabilities. However, because it does not typically have auto-discovery and is usually focused on a particular use case, it isn’t as good an agile-tool starting point as data virtualization. In agile information-product evolution, the less agile tool should follow the lead of the more agile one – or, to put it another way, a simple agile tool that covers all cases trumps an equivalently simple agile tool that only covers some.

So, let’s wrap it up: You can use data virtualization, today, in a more agile fashion than other available tools, as the basis of handling task 7, i.e., use it to auto-discover new data types out on the Web and integrate them continuously with the traditional information-handling process via global contextualization and the database veneer. This, in turn, will create pressure on the traditional process to become more agile, as IT tries to figure out how to “get ahead of the game” rather than drowning in a continuously-piling-up backlog of new-data-type requests. Of course, this approach doesn’t handle steps 4, 5, and 6 (target, customize, and plan); for those, you need to look to other types of tools – such as (if it ever grows up!) so-called agile BI. But data virtualization and its successors operate directly on task 7, directly following the lead of an agile information-product-evolution process. And so, in the long run, since information agility matters a lot, so does data virtualization.

Some IT-Buyer Considerations About Data Virtualization Vendors

To bring this discussion briefly down to a more practical level, if firms are really serious about business agility, then they should consider the present merits of various data virtualization vendors. I won’t favor one against another, except to say that I think that today’s smaller vendors – Composite Software, Denodo, and possibly Informatica – should be considered first, for one fundamental reason: they have survived over the last nine years in the face of competition from IBM among others.

The only way they could have done that, as far as I can see, is to constantly stay one step ahead in the kinds of data types that customers wanted them to support effectively. In other words, they were more agile – not necessarily very agile, but more agile than an IBM or an Oracle or a Sybase in this particular area. Maybe large vendors’ hype about improved agility will prove to be more than hype; but, as far as I’m concerned, they have to prove it to IT first, and not just in the features of their products, but also in the agility of their data-virtualization product development.

The Agile Buyer’s Not So Bottom Line

Sorry for the confusing heading, but it’s another opportunity to remind readers that business agility applies to everything, not just software development. Wherever agility goes, it is likely that speed and effectiveness of customer-coordinated change are the goals, and improving top and bottom lines the happy side-effect. And so with agile buyers of agile data-virtualization tools for information agility – look at the tool’s agility first.

To recap: the average organization badly needs information agility. Information agility, as in other areas, needs tools that “follow” the lead of the person(s) driving change. Data virtualization tools are best suited for that purpose as of now, both because they are best able to form the core of a lead-following information-product-evolution toolset, and because they integrate effectively with traditional information-handling and therefore speed its transformation into part of the agile process. Today’s smaller firms, such as Composite Software and Denodo, appear to be farther along the path to lead-following than other alternatives.

Above all: just because we have been focused on software development and related organizational functions, that doesn’t mean that information agility isn’t equally important. Isn’t it about time that you did some thought experiments of your own about information agility? And about data virtualization’s role in information agility?