On a weekend, bleak and dreary, as I was pondering, weak and
weary, on the present state and future of data virtualization technology, I was
struck by a sudden software design thought.
I have no idea whether it’s of worth; but I put it out there for discussion.
The immediate cause was an assertion that one future use for
data virtualization servers was as an appliance – in its present meaning,
hardware on which, pretty much, everything was designed for maximum performance
of a particular piece of software, such as an application, a database, or, in
this case, a data virtualization solution.
That I question: by its very
nature, most keys to data virtualization performance lie on the servers of the
databases and file management tools data virtualization servers invoke, and it
seems to me likely that having dedicated data-virtualization hardware will make
the architecture more complicated (thereby adding administrative and other
costs) to achieve a minimal gain in overall performance. However, it did lead to my thought, and I
call it the “Olympic database.”
The Olympic Database
Terrible name, you say.
This guy will never be a marketer, you say. That’s true. In fact, when they asked me as a
programmer for a name for Prime Computer software doing a graphical user
interface, I suggested Primal Screens.
For some reason, no one’s asked me to name something since then.
Anyway, the idea runs as follows. Assemble the usual array of databases (and
Hadoop, yada). Each will specialize in
handling particular types of data. One
can imagine splitting relational data between that suited for columnar and that
not so suited, and then applying a columnar database to the one and a
traditional relational database to the other, as Oracle Exadata appears to do. But
here’s the twist: each database will
also contain a relatively small subset of the data in at least one other
database – maybe of a different type. In
other words, up to 10%, say, of each database will be a duplicate of another
database – typically, the data that queries will typically want in
cross-database queries, or the data that in the past a database incorporates
just to save time switching between databases.
In effect, each database will have a cache of data in which it does not specialize, with its own interface to it, SQL
or other.
On top of that, we place a data virtualization server. Only this server’s primary purpose is not
necessarily to handle data of varying types that a particular database can’t
handle. Rather, the server’s purpose is
to carry out load balancing and query optimization across the entire set of
databases. It does this by choosing the
correct database for a particular type of data – any multiplexer can do that –
but also by picking the right database among two or several options when all
the data is found in two or more databases, as well as the right combination of
databases when no one database has all the data needed. It is, in effect, a very flexible method of
sacrificing some disk space for duplicate data for the purpose of query
optimization – just as the original relational databases sacrificed pure 9NF
and found that duplicating data in a star or snowflake schema yielded major performance
improvements in large-scale querying.
Now picture this architecture in your mind, with the data
stores as rings. Each ring will
intersect in a small way with at least one other data-store “ring” of
data. Kind of like the Olympic
rings. Even if it’s a terrible name,
there’s a reason I called it the Olympic database.
It seems to me that such an Olympic database would have three
advantages over anything out there:
specialization in multiple-data-type processing in an era in which that’s
becoming more and more common, a big jump in performance from increased ability
to load balance and optimize across databases, and a big jump in the ability to
change the caches and hence the load to balance dynamically – not just every
time the database vendor adds a new data type.
Why Use a Data Virtualization Server?
Well, because most of the technology is already there – and that’s
certainly not true for other databases or file management systems. To optimize queries, the “super-database” has
to know just which combination of specialization and non-specialization will
yield better performance – say, columnar or Hadoop “delayed consistency”. That’s definitely something a data
virtualization solution and supplier knows in general, and no one else does. We
can argue forever about whether incorporating XML data in relational databases is
better than two specialized databases – but the answer really is, it depends;
and only data virtualization servers know just how it depends.
The price for such a use of a data virtualization server
would be that data virtualization would need to go pretty much whole hog in
being a “database veneer”: full admin
tools, etc., just like a regular database. But here’s the thing: we wouldn’t get rid of the old data
virtualization server. It’s just as useful as it ever was, for the endless new
cases of new data types that no database has yet combined with its own
specialization. All the use cases of the
old data virtualization server will still be there. And an evolution of the data virtualization
will accept a fixed number of databases to support with a fixed number of data
types, in exchange for doing better than any of the old databases could in
those conditions.
Summary
Ta da! The Olympic database! So what do you think?
2 comments:
Interesting and innovative idea.
oem software
Thanks, great blog.
Post a Comment