Monday, October 15, 2012

Does Data Virtualization Cause a Performance Hit? Maybe The Opposite

In the interest of “truthiness”, consultants at Composite Software’s Data Virtualization Day last Wednesday said that the one likely tradeoff for all the virtues of a data virtualization (DV) server was a decrease in performance.  It was inevitable, they said, because DV inserts an extra layer of software between a SQL query and the engine performing that query, and that extra layer’s tasks necessarily increases response time. And, after all, these are consultants who have seen the effects of data-virtualization implementation in the real world. Moreover, it has always been one of the tricks of my analyst trade to realize that emulation (in many ways, a software technique similar to DV) must inevitably involve a performance hit – due to the additional layer of software.

And yet, I would assert that three factors difficult to discern in the heat of implementation may make data virtualization actually better-performing than the system as it was pre-implementation.  These are:
·         Querying optimization built into the data virtualization server

·         The increasingly prevalent option of cross-data-source queries

·         Data virtualization’s ability to coordinate across multiple instances of the same database.
Let’s take these one at a time.

DV Per-Database Optimization

I remember that shortly after IBM released their DV product in around 2005, they did a study in which they asked a bunch of their expert programmers to write a set of queries against an instance of IBM DB2, iirc, and then compared their product’s performance against these programmers’.  Astonishingly, their product won – and yet, seemingly, the deck was entirely stacked against it.  This was new programmer code optimized for the latest release of DB2, based on in-depth experience, and the DV product had that extra layer of software. What happened?
According to IBM, it was simply the fact that no single programmer could put together the kind of sophisticated optimization that was in the DV product. Among other things, this DV optimization considered not only the needs of the individual querying program, but also its context as part of a set of programs accessing the same database.  Now consider that in the typical implementation, the deck is not as stacked against DV: the programs being superseded may have been optimized for a previous release and never adequately upgraded, or the programmers who wrote them or kept them current with the latest release may have been inexperienced.  All in all, there is a significant chance (I wouldn't be surprised if it was better than 50%) that DV will perform better than the status quo for existing single-database-using apps “out of the box.”
Moreover, that chance increases steadily over time – and so an apparent performance hit on initial DV implementation will inevitably turn into a performance advantage 1-2 years down the line. Not only does the percentage of “older” SQL-involving code increase over time, but the need for database upgrade (as, for example, upgrading DB2 every 2 years should pay off in spades, according to my recent analyses), means that these DV performance advantages widen – and, if emulation is any guide, the performance cost from an extra layer never gets worse than 10-20%.

The Cross-Data-Source Querying Option

Suppose you had to merge two data warehouses or customer-facing apps as part of a takeover or merger.  If you used DV to do so, you might see (as in the previous section) the initial queries to either app or data warehouse be slower.  However, it seems to me that’s not the appropriate comparison.  You have to merge the two somehow.  The alternative is to physically merge the data stores and maybe the databases accessing those data stores.  If so, the comparison is with a merged data store for which neither set of querying code is optimized, and a database for which one of the two sets of querying code, at the least, is not optimized. In that case, DV should have an actual performance advantage, since it provides an ability to tap into the optimizations of both databases instead of sub-optimizing one or both.
And we haven’t even considered the physical effort and time of merging two data stores and two databases (very possibly, including the operational databases, many more than that).  DV has always sold itself on its major advantages in rapid implementation of merging – and has constantly proved its case.  It is no exaggeration to say that a year saved in merger time is a year of database performance improvement gained.
Again, as noted above, this is not obvious in the first DV implementation.  However, for those who care to look, it is definitely a real performance advantage a year down the line.
But the key point about this performance advantage of DV solutions is that this type of coordination of multiple databases/data stores instead of combining them into one or even feeding copies into one central data warehouse is becoming a major use case and strategic direction in large-enterprise shops. It was clear from DV Day that major IT shops have finally accepted that not all data can be funneled into a data warehouse, and that the trend is indeed in the opposite direction.  Thus, an increasing proportion (I would venture to say, in many cases approaching 50%) of corporate in-house data is going to involve cross-data-source querying, as in the merger case. And there, as we have seen, the performance advantages are probably on the DV side, compared to physical merging.

DV Multiple-Instance Optimization

This is perhaps a consideration more suited to abroad and to medium-sized businesses, where per-state or regional databases and/or data marts must be coordinated. However, it may well be a future direction for data warehouse and app performance optimization – see my thoughts on the Olympic database in a previous post. The idea is that these databases have multiple distributed copies of data. These copies have “grown like Topsy”, on an ad-hoc, as needed basis.  There is no overall mechanism for deciding how many copies to create in which instances, and how to load balance across copies.
That’s what a data virtualization server can provide.  It automagically decides how to optimize given today’s incidence of copies, and ensures in a distributed environment that the processing is “pushed down” to the right database instance. In other words, it is very likely that data virtualization provides a central processing software layer rather than local ones – so no performance hit in most cases – plus load balancing and visibility into the distribution of copies, which allows database administrators to achieve further optimization by changing that copy distribution. And this means that DV should, effectively implemented, deliver better performance than existing solutions in most if not all cases.
Where databases are used both operationally (including for master data management) and for a data warehouse, the same considerations may apply – even though we are now in cases where the types of operation (e.g., updates vs. querying) and the types of data (e.g., customer vs. financial) may be somewhat different. One-way replication with its attendant ETL-style data cleansing is only one way to coordinate the overall performance of multiple instances, not to mention queries spanning them.  DV’s added flexibility gives users the ability to optimize better in many cases across the entire set of use cases.
Again, this advantage may not have been perceived (a) because not many implementers are focused on the multiple-copy case and (b) because DV implementation is probably compared against performance against each individual instance instead of or as well as against the multiple-instance database as a whole. Nevertheless, at least theoretically, this performance advantage should appear – and especially because, in this case, the “extra software layer” should not typically add some DV performance cost.

The User Bottom Line:  Where’s The Pain?

It seems that, theoretically at least, we might expect to see actual performance gains over the next 1-2 years over “business as usual” from DV implementation in the majority of use cases, and that this proportion should increase, both over time after implementation and as corporations’ information architectures continue to elaborate. The key to detecting these advantages now, if the IT shop is willing to do it, is more sophisticated metrics about just what constitutes a performance hit or a performance improvement, as described above.
So maybe there isn’t such a performance tradeoff for all the undoubted benefits of DV, after all.  Or maybe there is.  After all, there is a direct analogy here with agile software development, which seems to lose by traditional metrics of cost efficiency and quality attention, and yet winds up lower-cost and higher-quality after all.  The key secret ingredient in both is ability to react or proact rapidly in response to a changing environment, and better metrics reveal that overall advantage. But the “tradeoff” for both DV and agile practices may well be the pain of embracing change instead of reducing risk.  Except that practitioners of agile software development report that embracing change is actually a lot more fun.  Could it be that DV offers a kind of support for organizational “information agility” that has the same eventual effect: gain without pain?
Impossible. Gain without pain. Perish the very thought.  How will we know we are being organizationally virtuous without the pains that accompany that virtue?  How could we possibly improve without sacrifice? 
Well, I don’t know the answer to that one.  However, I do suggest that maybe, rather than the onus being on DV to prove it won’t torch performance, the onus should perhaps be on those advocating the status quo, to prove it will.  Because it seems to me that there are plausible reasons to anticipate improvements, not decreases, in real-world performance from DV.

1 comment:

Kate Dunkin said...

Great post! I came across your blog while I was reading articles about data virtualization and how beneficial they can be. Thank you for sharing this with us, it was very informative Wayne!