Monday, June 30, 2008

Data Virtualization: It Grows on You

When I first heard the term “data virtualization” from Composite Software, I admit, I was skeptical. Let’s face it, virtualization is one of the all-inclusive marketing terms of our time; it seems like everything is being labeled as virtualized these days. But when I sat down and tried to make sense of all of the “virtual” technologies I knew, data virtualization fit in neatly. In fact, it’s a logical extension of virtualization technology. And therefore, like other virtualization technologies, it has definite and distinct benefits to the user.

Let’s start with my definition. A “virtual” technology is one that detaches the logical scope of a system from its physical scope. More concretely, virtual memory, the virtual machine, virtual storage, etc. all make a subset or a superset of a physical system/resource like a computer, appear like one system/resource. Virtualization is the process of changing from the old physical way of seeing and handling a system to the “virtual” way.

Now note that all virtual technologies can either split up a single system or combine multiple systems, and that they can involve data or processes. For example, the virtual-machine meaning of virtualization (now its most common meaning) splits up the processes on a single machine into multiple virtual machines. The virtual-memory meaning of virtualization pretends that a small main memory (data) is really a large storage device. The storage virtualization meaning treats multiple systems (really data systems attached to multiple computer systems) as if they were one huge storage device. In fact, clustering and network file systems (really another form of virtualization) allow you to treat multiple systems (processes) as if they were one computer.
Here’s a neat little Table that captures how all the forms of virtualization that are being talked about these days fit into a nice neat framework (all right, it’s not really that neat in real life, but it’s a very good starting point).

,,,,,,,,Split Single System ,,,,,,,,,,,,,,,,,,, Combine Multiple Systems

Data: V. memory, V. disk ,,,,,,,,,,,,,,,,,, Storage virtualization, Data virtualization

Process: V. machine, app/desktop/profile v. ,,, Clustering, network file systems

Note that application, desktop, and profile virtualization may involve putting the virtual environment on another computer; but it’s still representing a part of a computer, not the whole thing.

Now all forms of virtualization share a common benefit of flexibility to users, but each category has specific benefits of its own. The benefit of single-system virtualization is a very good tradeoff of performance for cost (single-system); for much less cost and for a small decrease in performance, you can substitute disk for main memory (virtual memory, typically operating at 90% of the performance of an equivalent all-main-memory machine), or consolidate all your applications on one machine (virtual-machine virtualization).

The benefits of multiple-system virtualization still tend to be performance vs. cost, in a more complicated fashion. Storage virtualization makes administration easier, and administrators cost money. Clustering simplifies the programmer’s and system administrator’s life, as well as allowing failover that makes systems more robust; the tradeoff is less scalable performance (generally somewhere between 60% and 80% increase in single-system-equivalent performance per system added, vs. up to 100% per processor added in symmetric multiprocessing for up to 30 or more processors). Meanwhile, virtualization of processes, single-system or multiple-system, delivers flexibility to the system administrator to allocate resources and manage workloads; while virtualization of data makes the programmer’s job easier as well as easing the storage or database administrator’s burden.

Viewed in this way, the definition of data virtualization is straightforward: it is the implementation of a multiple-system data-source veneer that lets programmers, administrators, and users all view multiple data sources as one. And, likewise, the benefits of data virtualization flow from its being virtualization and its being multiple-system/data-oriented. That is, there may be a decrease in query performance (but maybe not; some IBM studies show that a tuned Enterprise Information Integration [EII] engine can actually deliver better performance on cross-database queries, by load balancing effectively). However, you will also give the administrator a simpler view of the “data archipelagoes” typical within an enterprise; you will give the programmer “one transactional interface to choke”, just as SOAs give the programmer “one application interface”; and you will give users, ultimately, a broader access to data. Ultimately, that adds up to more flexibility and less cost.

Like other forms of virtualization, data virtualization is not a pipe dream; but it’s not easy to get there, either. The basics are an enterprise metadata repository and a “veneer” that makes multiple data sources look like one system. In various forms, EII, the data warehouse, master data management (MDM), and repository products give you these basics; but it is extraordinarily hard to get the solution’s hands around all the data in the system and its copies and its relationships, much less semi-automatically adjusting to all the new types of data that are coming in and the important data outside the enterprise on the semantic Web.
However, just as the other forms of virtualization are useful even though they are not implemented company-wide, so is data virtualization. Within a data center, within a line of business, or just applied to the company’s most important data (as in master data management), data virtualization can deliver its benefits in a targeted fashion – as EII and MDM have already shown.

So yes, data virtualization may be marketing hype, but it also represents an important new body of products with important benefits to IT and to the business as a whole. Later, I hope to get down to the details.

Friday, June 20, 2008

Where are the EII Tools of Yesteryear?

All in all, it seems bizarre to me to realize that the old “pure” EII (Enterprise Information Integration) vendors are no longer thick on the ground. It was only 6 years ago that I first discovered EII tools and issued my first report – an extremely short half-life for a technology. And yet, of the existing or nascent EII tools then, Metamatrix has gone to Red Hat, Avaki to Sybase, Venetica to IBM, and another whose name I no longer recall was folded into BEA as Liquid Data and is now being re-folded into Oracle. Meanwhile, IBM has renamed Information Integrator as Federation Server, and rarely mentions EII. Of the oldies, only Composite Software and Attunity remain proudly independent.

And yet, the picture is not as bleak as it appears at first glance. Composite Software and Attunity are still small (under $100M in revenues), but continue to survive and even thrive. Ipedo and Denodo, newer entrants with interesting differentiators (Ipedo with its XML database, Denodo with its Web-search capabilities), are likewise out there. In fact, Stylus Studio with its EII support for programmers and Silver Creek Systems with its “missing link for EII” appear to have entered the fray recently.

Just as important, many of the infrastructure-software companies that have acquired EII vendors have come to realize that EII is a positive thing to customers and should be kept as a distinct product, not treated as another component to fold into a product or hide under the umbrella of an overall strategy. Sybase, especially, has been unusually active in positioning their EII solution as a “key component” of their key data-integration product suite.

However, it is also fair to say that EII’s marketing is not all it should be. A Google search turned up very few of these tools, and in the advertising on the right-hand side, several of the companies were not in EII at all (SAS? I don’t think so) and Sybase’s webcast on the subject was no longer available.

What’s going on? I would suggest that one of the key reasons for EII’s mixed picture of success is that vendors have begun doing what they should have done three years ago: leverage EII tools’ metadata repositories as separate products/value-add. Metadata management is now In, to the point that even Microsoft and EMC are now talking about enterprise-wide metadata repositories (and realizing just how tough the job is). In other words, some of the revenue and visibility that are derived from EII products are now showing up under the heading of master data management and composite apps (along the lines of Denodo’s “data mashups”) instead of being credited to EII.

This is not to say that EII is the only source of an enterprise metadata repository; EAI (Enterprise Application Integration) tools like IBM’s Ascential, data-search tools like EMC’s Tablus, and “pure” repository plays like Syspedia are other tools that can be used as the foundation of an enterprise-metadata-repository product. Still, I would argue that EII tools have virtues that none of the others share, or at least to the same extent. Specifically, creating a good repository is key to the success of a good EII implementation; such repositories are often used for querying apps, so optimizing query performance should be considered; and EII tools do both automatic discovery of a wide variety of automated data sources and support for hand-definition of data relationships and data with different names but the same meaning (“customer ID” or “buyer name”). Contrast this with EAI, which is usually restricted to a narrower range of structured data, is often handed the metadata rather than discovering it, does not have to consider real-time performance in queries, and often doesn’t provide support for the kind of data reconciliation involved in master data management. I was reminded of this recently, when I talked to Composite Software about a new Composite Discovery product whose foundation is an innovative way of discovering “probable” relationships between data items/records across data sources.

So the EII market is doing quite well, thank you, even if many of its pioneers have been folded into larger companies. However, EII remains under-utilized by large infrastructure-software vendors, not because their users don’t appreciate their EII implementations, but because vendors aren’t giving visibility to their EII technology’s usefulness in larger projects such as master data management and composite-application development. Where are the EII tools of yesteryear? Doing well; but not visibly enough.