Monday, October 8, 2012

A Naive Idea About the Future of Data Virtualization Technology


On a weekend, bleak and dreary, as I was pondering, weak and weary, on the present state and future of data virtualization technology, I was struck by a sudden software design thought.  I have no idea whether it’s of worth; but I put it out there for discussion.

The immediate cause was an assertion that one future use for data virtualization servers was as an appliance – in its present meaning, hardware on which, pretty much, everything was designed for maximum performance of a particular piece of software, such as an application, a database, or, in this case, a data virtualization solution.  That I question:  by its very nature, most keys to data virtualization performance lie on the servers of the databases and file management tools data virtualization servers invoke, and it seems to me likely that having dedicated data-virtualization hardware will make the architecture more complicated (thereby adding administrative and other costs) to achieve a minimal gain in overall performance.  However, it did lead to my thought, and I call it the “Olympic database.”

The Olympic Database

Terrible name, you say.  This guy will never be a marketer, you say.  That’s true. In fact, when they asked me as a programmer for a name for Prime Computer software doing a graphical user interface, I suggested Primal Screens.  For some reason, no one’s asked me to name something since then.

Anyway, the idea runs as follows.  Assemble the usual array of databases (and Hadoop, yada).  Each will specialize in handling particular types of data.  One can imagine splitting relational data between that suited for columnar and that not so suited, and then applying a columnar database to the one and a traditional relational database to the other, as Oracle Exadata appears to do. But here’s the twist:  each database will also contain a relatively small subset of the data in at least one other database – maybe of a different type.  In other words, up to 10%, say, of each database will be a duplicate of another database – typically, the data that queries will typically want in cross-database queries, or the data that in the past a database incorporates just to save time switching between databases.  In effect, each database will have a cache of data in which it does not specialize, with its own interface to it, SQL or other.

On top of that, we place a data virtualization server.  Only this server’s primary purpose is not necessarily to handle data of varying types that a particular database can’t handle.  Rather, the server’s purpose is to carry out load balancing and query optimization across the entire set of databases.  It does this by choosing the correct database for a particular type of data – any multiplexer can do that – but also by picking the right database among two or several options when all the data is found in two or more databases, as well as the right combination of databases when no one database has all the data needed.  It is, in effect, a very flexible method of sacrificing some disk space for duplicate data for the purpose of query optimization – just as the original relational databases sacrificed pure 9NF and found that duplicating data in a star or snowflake schema yielded major performance improvements in large-scale querying.

Now picture this architecture in your mind, with the data stores as rings.  Each ring will intersect in a small way with at least one other data-store “ring” of data.  Kind of like the Olympic rings.  Even if it’s a terrible name, there’s a reason I called it the Olympic database.

It seems to me that such an Olympic database would have three advantages over anything out there:  specialization in multiple-data-type processing in an era in which that’s becoming more and more common, a big jump in performance from increased ability to load balance and optimize across databases, and a big jump in the ability to change the caches and hence the load to balance dynamically – not just every time the database vendor adds a new data type.

Why Use a Data Virtualization Server?

Well, because most of the technology is already there – and that’s certainly not true for other databases or file management systems.  To optimize queries, the “super-database” has to know just which combination of specialization and non-specialization will yield better performance – say, columnar or Hadoop “delayed consistency”.  That’s definitely something a data virtualization solution and supplier knows in general, and no one else does. We can argue forever about whether incorporating XML data in relational databases is better than two specialized databases – but the answer really is, it depends; and only data virtualization servers know just how it depends.

The price for such a use of a data virtualization server would be that data virtualization would need to go pretty much whole hog in being a “database veneer”:  full admin tools, etc., just like a regular database. But here’s the thing:  we wouldn’t get rid of the old data virtualization server. It’s just as useful as it ever was, for the endless new cases of new data types that no database has yet combined with its own specialization.  All the use cases of the old data virtualization server will still be there.  And an evolution of the data virtualization will accept a fixed number of databases to support with a fixed number of data types, in exchange for doing better than any of the old databases could in those conditions.

Summary

Ta da! The Olympic database! So what do you think?