Tuesday, July 22, 2008

Data Virtualization in the Real World

In a previous piece, I noted that data virtualization is (a) a logical extension of the idea of virtualization, (b) has definite benefits to IT and to the enterprise, and (c) is implementable but is not easy to implement. The basics, I noted, are an enterprise metadata repository and a “veneer” that makes multiple data sources look like one system. The end product is infrastructure software that makes data, presented as information (that is, in an understandable and usable non-physical format, such as in a database), appear as if it were in one cross-system data store. So what is out there to create a data virtualization solution today, and how does it need to be improved?

Today’s key data-virtualization technologies are Enterprise Information Integration (EII), the data warehouse, master data management (MDM), and metadata management. Each now has solutions creating value inside real-world proactive enterprises; each has its pros and cons. Let’s examine each in turn.

EII products provide a SQL/XQuery veneer for real-time database transactions crossing multiple data sources, including relational data, content, and the Web. In order to do this, they typically auto-discover data sources of these types, and store the information in a metadata repository, supplemented by the enterprise’s knowledge of data semantics, copies, and relationships. The pros for using EII as the basis for a data virtualization solution are its strengths in gathering metadata, its support for real-time transactions, and its flexibility in handling a wide range of data types. The cons are its typical use for specific projects (which means added work extending it to enterprise scope) and its lack of enforcement of corporate standards compared to MDM. Useful EII tools for data virtualization include IBM’s WebSphere Federation Server, Oracle/BEA WebLogic for Liquid Data, Sybase’s Avaki tool, Composite Software’s and Attunity’s products, Ipedo XIP (which also includes a database), and Red Hat/MetaMatrix.

Originally, data warehouses were receptacles for all corporate data replicated periodically (including data cleansing and merging) from multiple OLTP (online transaction processing) databases, with business-intelligence queries run against the days-old merged data. Recently, the lines have blurred, with data warehouses now able to query on the operational and content databases as well as the data warehouse, and to refresh in such a way that queries are on “near real-time” data. The pros of using today’s data warehouse for data virtualization are unmatched data quality and lots of available applications to use the virtualized data. The cons are that the data warehouse is getting farther and farther away from storing all of an organization’s data and therefore may not fit some data-virtualization needs, that the size of the data warehouse may slow both queries and update-type transactions against the virtualized data, and that a data warehouse typically does no auto-discovery of metadata. Useful data warehouses for data virtualization include the IBM, HP, NCR, Microsoft, Oracle, and Sybase databases.

The MDM field has gone from a gleam in vendors’ eyes to the focus of enterprise attention in about 2 years, but it’s still not a mature technology. The idea is to select key enterprise data and provide a “single view” via a transactional veneer and cross-database synchronization, typically involving a cross-database metadata repository. MDM accommodates a range of implementation architectures, from EII-like (no data moves around) to data-warehouse-like (all the data is replicated to a central repository). Moreover, MDM typically builds on top of EII, EAI (Enterprise Application Integration), and data-warehouse ETL (extract-transform-load) technologies. The pros of MDM are that it delivers virtualization of data that really matters to the organization, that the architecture is flexible and therefore can be fine-tuned for performance, and that it improves data quality in such a way that the enterprise really buys into IT’s data-virtualization effort. The cons are that it typically does not deal with some types of data virtualization (such as local data in different databases or content-heavy data), that it may not provide as extensive auto-discovery capabilities (although it is improving in that area), and that as a relatively immature technology it will typically require high service costs for vendor hand-holding. MDM tools now include IBM MDM Server, Sun’s MDM Suite, Microsoft’s Bulldog MDM project, Kalido MDM, Software AG’s webMethods Master Data Manager, and Oracle Master Data Management Suite.

Finally, enterprise metadata management (EMM) tools are the most immature of all of the data virtualization technologies. Here, the focus is on creating a repository of metadata for all (or most) enterprise data, with less focus on supplying tools for using that repository. The pros of EMM tools appear to be their strengths in auto-discovery and their savvy in representing metadata at the business level. The cons are the immaturity of the technology and the lack of tools for transactional use of the repository. While there are a few smaller-vendor products that have been out for a while from the likes of Adaptive Repository, the larger vendors’ products (e.g., IBM, CA, and Oracle/BEA’s AquaLogic Enterprise Repository) are pretty new to the market.

The point of this survey is that while it is hard to cite a product in the market that is specifically aimed at data virtualization, there is plenty of choice for users who want to repurpose an EII, MDM, data warehouse, or EMM product for data virtualization, and plenty of related real-world experience to draw on. Rather, the key barrier to data-virtualization success is rethinking the enterprise’s information architecture in terms of virtualization. Just as storage virtualization has allowed users to rethink their storage in terms of what can be grouped together, and thus has saved storage administration costs, data virtualization is best used to rethink how data stores can be grouped together. Then, by picking the product that is easiest and cheapest to repurpose for data virtualization, users can deliver its benefits immediately, and cost-effectively.