And yet, I would assert that three factors difficult to
discern in the heat of implementation may make data virtualization actually better-performing than the system as it
was pre-implementation. These are:
·
Querying optimization built into the data
virtualization server
·
The increasingly prevalent option of cross-data-source
queries
·
Data virtualization’s ability to coordinate
across multiple instances of the same database.
Let’s take these one at a time.
DV Per-Database Optimization
I remember that shortly after IBM released their DV product
in around 2005, they did a study in which they asked a bunch of their expert
programmers to write a set of queries against an instance of IBM DB2, iirc, and
then compared their product’s performance against these programmers’. Astonishingly, their product won – and yet,
seemingly, the deck was entirely stacked against it. This was new programmer code optimized for
the latest release of DB2, based on in-depth experience, and the DV product had
that extra layer of software. What happened?
According to IBM, it was simply the fact that no single
programmer could put together the kind of sophisticated optimization that was
in the DV product. Among other things, this DV optimization considered not only
the needs of the individual querying program, but also its context as part of a
set of programs accessing the same database. Now consider that in the typical
implementation, the deck is not as stacked against DV: the programs being
superseded may have been optimized for a previous release and never adequately
upgraded, or the programmers who wrote them or kept them current with the
latest release may have been inexperienced.
All in all, there is a significant chance (I wouldn't be surprised if it
was better than 50%) that DV will perform better than the status quo for
existing single-database-using apps “out of the box.”
Moreover, that chance increases steadily over time – and so
an apparent performance hit on initial DV implementation will inevitably turn
into a performance advantage 1-2 years down the line. Not only does the
percentage of “older” SQL-involving code increase over time, but the need for
database upgrade (as, for example, upgrading DB2 every 2 years should pay off
in spades, according to my recent analyses), means that these DV performance advantages
widen – and, if emulation is any guide, the performance cost from an extra
layer never gets worse than 10-20%.
The Cross-Data-Source Querying Option
Suppose you had to merge two data warehouses or
customer-facing apps as part of a takeover or merger. If you used DV to do so, you might see (as in
the previous section) the initial queries to either app or data warehouse be
slower. However, it seems to me that’s
not the appropriate comparison. You have
to merge the two somehow. The
alternative is to physically merge the data stores and maybe the databases
accessing those data stores. If so, the
comparison is with a merged data store for which neither set of querying code
is optimized, and a database for which one of the two sets of querying code, at
the least, is not optimized. In that case, DV should have an actual performance
advantage, since it provides an ability to tap into the optimizations of both
databases instead of sub-optimizing one or both.
And we haven’t even considered the physical effort and time of
merging two data stores and two databases (very possibly, including the
operational databases, many more than that).
DV has always sold itself on its major advantages in rapid
implementation of merging – and has constantly proved its case. It is no exaggeration to say that a year saved
in merger time is a year of database performance improvement gained.
Again, as noted above, this is not obvious in the first DV
implementation. However, for those who
care to look, it is definitely a real performance advantage a year down the
line.
But the key point about this performance advantage of DV
solutions is that this type of coordination of multiple databases/data stores
instead of combining them into one or even feeding copies into one central data
warehouse is becoming a major use case and strategic direction in
large-enterprise shops. It was clear from DV Day that major IT shops have
finally accepted that not all data can be funneled into a data warehouse, and
that the trend is indeed in the opposite direction. Thus, an increasing proportion (I would
venture to say, in many cases approaching 50%) of corporate in-house data is
going to involve cross-data-source querying, as in the merger case. And there,
as we have seen, the performance advantages are probably on the DV side,
compared to physical merging.
DV Multiple-Instance Optimization
This is perhaps a consideration more suited to abroad and to
medium-sized businesses, where per-state or regional databases and/or data
marts must be coordinated. However, it may well be a future direction for data
warehouse and app performance optimization – see my thoughts on the Olympic
database in a previous post. The idea is that these databases have multiple
distributed copies of data. These copies have “grown like Topsy”, on an ad-hoc,
as needed basis. There is no overall
mechanism for deciding how many copies to create in which instances, and how to
load balance across copies.
That’s what a data virtualization server can provide. It automagically decides how to optimize
given today’s incidence of copies, and ensures in a distributed environment
that the processing is “pushed down” to the right database instance. In other
words, it is very likely that data virtualization provides a central processing
software layer rather than local ones – so no performance hit in most cases –
plus load balancing and visibility into the distribution of copies, which
allows database administrators to achieve further optimization by changing that
copy distribution. And this means that DV should, effectively implemented,
deliver better performance than existing solutions in most if not all cases.
Where databases are used both operationally (including for
master data management) and for a data warehouse, the same considerations may
apply – even though we are now in cases where the types of operation (e.g., updates
vs. querying) and the types of data (e.g., customer vs. financial) may be
somewhat different. One-way replication with its attendant ETL-style data
cleansing is only one way to coordinate the overall performance of multiple
instances, not to mention queries spanning them. DV’s added flexibility gives users the
ability to optimize better in many cases across the entire set of use cases.
Again, this advantage may not have been perceived (a)
because not many implementers are focused on the multiple-copy case and (b) because
DV implementation is probably compared against performance against each
individual instance instead of or as well as against the multiple-instance
database as a whole. Nevertheless, at least theoretically, this performance
advantage should appear – and especially because, in this case, the “extra
software layer” should not typically add some DV performance cost.
The User Bottom Line: Where’s The Pain?
It seems that, theoretically at least, we might expect to
see actual performance gains over the next 1-2 years over “business as usual”
from DV implementation in the majority of use cases, and that this proportion
should increase, both over time after implementation and as corporations’
information architectures continue to elaborate. The key to detecting these
advantages now, if the IT shop is willing to do it, is more sophisticated
metrics about just what constitutes a performance hit or a performance
improvement, as described above.
So maybe there isn’t such a performance tradeoff for all the
undoubted benefits of DV, after all. Or
maybe there is. After all, there is a
direct analogy here with agile software development, which seems to lose by
traditional metrics of cost efficiency and quality attention, and yet winds up
lower-cost and higher-quality after all.
The key secret ingredient in both is ability to react or proact rapidly
in response to a changing environment, and better metrics reveal that overall
advantage. But the “tradeoff” for both DV and agile practices may well be the
pain of embracing change instead of reducing risk. Except that practitioners of agile software
development report that embracing change is actually a lot more fun. Could it be that DV offers a kind of support
for organizational “information agility” that has the same eventual effect:
gain without pain?
Impossible. Gain without pain. Perish the very thought. How will we know we are being
organizationally virtuous without the pains that accompany that virtue? How could we possibly improve without
sacrifice?
Well, I don’t know the answer to that one. However, I do suggest that maybe, rather than
the onus being on DV to prove it won’t torch performance, the onus should perhaps
be on those advocating the status quo, to prove it will. Because it seems to me that there are
plausible reasons to anticipate improvements, not decreases, in real-world performance
from DV.
1 comment:
Great post! I came across your blog while I was reading articles about data virtualization and how beneficial they can be. Thank you for sharing this with us, it was very informative Wayne!
Post a Comment