When I first heard the term “data virtualization” from Composite Software, I admit, I was skeptical. Let’s face it, virtualization is one of the all-inclusive marketing terms of our time; it seems like everything is being labeled as virtualized these days. But when I sat down and tried to make sense of all of the “virtual” technologies I knew, data virtualization fit in neatly. In fact, it’s a logical extension of virtualization technology. And therefore, like other virtualization technologies, it has definite and distinct benefits to the user.
Let’s start with my definition. A “virtual” technology is one that detaches the logical scope of a system from its physical scope. More concretely, virtual memory, the virtual machine, virtual storage, etc. all make a subset or a superset of a physical system/resource like a computer, appear like one system/resource. Virtualization is the process of changing from the old physical way of seeing and handling a system to the “virtual” way.
Now note that all virtual technologies can either split up a single system or combine multiple systems, and that they can involve data or processes. For example, the virtual-machine meaning of virtualization (now its most common meaning) splits up the processes on a single machine into multiple virtual machines. The virtual-memory meaning of virtualization pretends that a small main memory (data) is really a large storage device. The storage virtualization meaning treats multiple systems (really data systems attached to multiple computer systems) as if they were one huge storage device. In fact, clustering and network file systems (really another form of virtualization) allow you to treat multiple systems (processes) as if they were one computer.
Here’s a neat little Table that captures how all the forms of virtualization that are being talked about these days fit into a nice neat framework (all right, it’s not really that neat in real life, but it’s a very good starting point).
,,,,,,,,Split Single System ,,,,,,,,,,,,,,,,,,, Combine Multiple Systems
Data: V. memory, V. disk ,,,,,,,,,,,,,,,,,, Storage virtualization, Data virtualization
Process: V. machine, app/desktop/profile v. ,,, Clustering, network file systems
Note that application, desktop, and profile virtualization may involve putting the virtual environment on another computer; but it’s still representing a part of a computer, not the whole thing.
Now all forms of virtualization share a common benefit of flexibility to users, but each category has specific benefits of its own. The benefit of single-system virtualization is a very good tradeoff of performance for cost (single-system); for much less cost and for a small decrease in performance, you can substitute disk for main memory (virtual memory, typically operating at 90% of the performance of an equivalent all-main-memory machine), or consolidate all your applications on one machine (virtual-machine virtualization).
The benefits of multiple-system virtualization still tend to be performance vs. cost, in a more complicated fashion. Storage virtualization makes administration easier, and administrators cost money. Clustering simplifies the programmer’s and system administrator’s life, as well as allowing failover that makes systems more robust; the tradeoff is less scalable performance (generally somewhere between 60% and 80% increase in single-system-equivalent performance per system added, vs. up to 100% per processor added in symmetric multiprocessing for up to 30 or more processors). Meanwhile, virtualization of processes, single-system or multiple-system, delivers flexibility to the system administrator to allocate resources and manage workloads; while virtualization of data makes the programmer’s job easier as well as easing the storage or database administrator’s burden.
Viewed in this way, the definition of data virtualization is straightforward: it is the implementation of a multiple-system data-source veneer that lets programmers, administrators, and users all view multiple data sources as one. And, likewise, the benefits of data virtualization flow from its being virtualization and its being multiple-system/data-oriented. That is, there may be a decrease in query performance (but maybe not; some IBM studies show that a tuned Enterprise Information Integration [EII] engine can actually deliver better performance on cross-database queries, by load balancing effectively). However, you will also give the administrator a simpler view of the “data archipelagoes” typical within an enterprise; you will give the programmer “one transactional interface to choke”, just as SOAs give the programmer “one application interface”; and you will give users, ultimately, a broader access to data. Ultimately, that adds up to more flexibility and less cost.
Like other forms of virtualization, data virtualization is not a pipe dream; but it’s not easy to get there, either. The basics are an enterprise metadata repository and a “veneer” that makes multiple data sources look like one system. In various forms, EII, the data warehouse, master data management (MDM), and repository products give you these basics; but it is extraordinarily hard to get the solution’s hands around all the data in the system and its copies and its relationships, much less semi-automatically adjusting to all the new types of data that are coming in and the important data outside the enterprise on the semantic Web.
However, just as the other forms of virtualization are useful even though they are not implemented company-wide, so is data virtualization. Within a data center, within a line of business, or just applied to the company’s most important data (as in master data management), data virtualization can deliver its benefits in a targeted fashion – as EII and MDM have already shown.
So yes, data virtualization may be marketing hype, but it also represents an important new body of products with important benefits to IT and to the business as a whole. Later, I hope to get down to the details.