Friday, January 13, 2012

The Other BI: HP Vertica and Columnar Databases

This blog post highlights a software company and technology that I view as potentially useful to organizations investing in business intelligence (BI) and analytics in the next few years. Note that, in my opinion, this company and solution are not typically “top of the mind” when we talk about BI today.

The Importance of Vertica-Type Columnar Database Technology to BI

Last year, I wrote a blog post saying that it was likely that HP would underestimate the columnar database technology in Vertica, and if so they were missing a major opportunity. In the last year, HP has been pretty quiet about Vertica, but I have partially changed my mind, to the point where I want to call attention to Vertica as a less visible candidate for IT buyers to get the full benefits of columnar database technology over the next 2-3 years.

Let’s start with columnar technology. Here, I want to go more in-depth into Vertica’s core technology than usual, because it’s an excellent way to begin to see the benefits of columnar beyond traditional row-oriented databases.

The original idea of Vertica was to recast the relational database to focus on the (data warehousing) case where there are few if any updates. The redesign started with the idea that the data should be stored in "columns" rather than rows; the details of this are that the columns themselves (because they don't have to follow relational dogma) can be stored in a highly compressed format, with lots of compression techniques like inverted list, bit-mapped indexing, and hashing, as appropriate. Thus, (a) the database can use the column format to zero in faster on the data that the query is gathering, (b) because the data is compressed an average of 10 times (according to Vertica), more data can be crammed into main memory for faster processing. Result: a claimed 10-100 times speedup in performance, comparable to in-memory databases but far more scalable. It also means the database can handle at least 10 times more data (say, 100 terabytes instead of 5) with the same performance for a given query; or that the data center can use an order of magnitude less storage.

Now, all this does not come without a cost, and the typical cost would at first seem to be speed of updating. That is, the column storage format requires more revision of the data stored on disk when an update arrives, so update is slower. But this is counteracted by the ability to load more of localized data at once into main memory in a compressed form, for faster in-memory updating. Only at update frequencies typical of old-style operational online transaction processing (OLTP) does the row-oriented relational database have a clear edge.

The elaboration of the design in Vertica is that the basic data is also stored as "projections" (aka materialized views). That is, a set of columns in a tuple is stored one (relational) way; each column also shows up in a projection, but the projection is cross-tuple (one from tuple A, one from tuple B, etc.). This accomplishes two things: one, it gives an alternative way of querying which may be faster than basic storage, and two, it gives redundancy and therefore robustness, in a similar way to RAID 5 (projections can be "striped" across disks).

Now, here's where things get really interesting. Practically speaking, today, in data-warehousing-type databases, updates via "load windows" are becoming more and more frequent, to the point where data is pretty up-to-date and updates are a bigger part of data warehousing. To keep "write locks" from gumming up performance (especially with column update being slower), Vertica splits the storage into a write-optimized column store (WOS; effectively, a cache) and a Read-optimized Column Store (ROS). Periodically, the WOS becomes the ROS. So the write locks for the updates only interfere with reads when there’s a mass update. At the same time, such a mass update can re-store whole chunks of the ROS for optimum storage efficiency. Moreover, to gain currency, the query can be carried out across the ROS and WOS. And, because there is all this redundancy, there is no need for logs—another performance improvement. Note that because of its redundancy, Vertica doesn't need to do roll-back/roll-forward nor backup/restore.

The net of all this for IT buyers is that columnar databases in general, and Vertica in particular, should be able to deliver on average much better performance than traditional relational databases in the majority of not-highly-update-intensive cases, due mostly to its compression abilities, and that addition of other technologies like in-memory technology to both alternatives will not alter this superiority.

The Relevance of HP Vertica to BI

This kind of approach cries out for integration with or development of sophisticated admin tools, expansion beyond data warehousing and analytics to “mixed” transactions in competition with the noSQL fad, better programming tools to build up a war chest of business/industry customized solutions, and using a relational database as an OLTP complement. The resulting data-management platform would be a solid alternative for all sizes of enterprise to the “relational fits all” or “let the thousand flowers bloom” strategies of most organizations.

Once this platform is in place, it needs to become the keystone of enterprise architectures, not just an analytics or business intelligence “super-scaling” engine. That means adding integration with semi-structured and unstructured data. It also means adding major functionality for handling content, and integration with storage software for additional performance optimization. And so, anticipating that HP would not do this, I criticized the HP acquisition of Vertica last year.

Well, two things happened: HP did more than I thought it would, and competitors did less. HP bought a company called Autonomy, which added semi-structured/unstructured data support. Necessarily, this takes Vertica beyond pure data-warehousing-style analytics into a more update-intensive world, and HP’s redirection of Mercury Interactive towards agile ALM (application lifecycle management) associated Vertica with better programming tools. Meanwhile, SAP took its eye off Sybase IQ with its focus on HANA, IBM at least temporarily walked away from its Netezza semi-columnar database technology, and Oracle’s columnar-optional appliance ran into questions about its long-term hardware growth path. In other words, the result of half a loaf from HP and less than half a loaf from everyone else is that Vertica is moving towards leadership status in delivering columnar database technology to all scales of BI and analytics.

Meanwhile, of course, only the deluded think that HP will suddenly vanish, while database technology and the rest of the new software embed themselves ever deeper in HP’s DNA. HP Vertica is going to be around for quite a while; and it will be an attractive option for quite a while.

Potential Uses of Vertica-Type Columnar-Based BI for IT

The use cases of a columnar database IT is straightforward. IT should use a columnar database in new projects as an alternative or complement to a traditional relational database, unless the operations are update-intensive, in which case row-oriented relational is preferred. As a complement, columnar databases operate on a “switching” basis, in which an overall engine decides which queries should be allocated to row-oriented, which to columnar, usually on the basis of whether two or more of the “fields” involved in an operation can be compressed highly by using a columnar format. Oracle (and, until recently, IBM Netezza) takes this approach; but IT can also do its own switching mechanism.

And that’s it. Over the next 2-3 years, if not already, columnar can scale as high as querying, can integrate with as many data types and upper-level tools and applications, and can evolve to greater performance/scalability just as rapidly as the traditional row-oriented database. In the long run, in a lot of use cases, and sometimes in the short run, that favors Vertica-type columnar.

However, right now, columnar requires in some cases to “grow into” its assigned role in a new project, by adding administrative tools for particular cases. Therefore, in most applications where 24x7 operation and an adequate level of customer response time is business-critical, relational row-oriented should still be preferred. That should leave plenty of analytical and other BI uses for which Vertica-type columnar database software will deliver an important performance advantage.

The Bottom Line for IT Buyers

Over the next few years, IT buyers can take one of two views: the author of this blog post is prescient, columnar will replace row-oriented in the majority of new applications in BI and other areas, and we should include columnar in all our short lists from now on; or, the author of this blog post is wrong about the future, but columnar is useful for some things right now, and trying to standardize on one database is a fool’s game that we no longer bother to try to play. If IT buyers hold the second view, then they should be focused on applying columnar to analysis of huge amounts of structured data with “sparse” fields where high compression is achievable – like five-field customer names (Mr. John Taylor Jakes, Jr.) and product codes. Spend the resulting improvements on increased performance, lowered storage costs, or both.

Again, this is not a matter of a pre-short list, unless you have a “gray area” BI project involving somewhat update-intensive or somewhat business-critical little-downtime apps, in which case you want to wait for columnar to evolve a little. In all other cases, HP Vertica should go on the short list along with the obvious others, like Sybase IQ. Right now, Vertica appears to be ahead both in some of the needed features to adapt to new analytics needs and in speed of evolution. One never knows – but over the next year, that leadership role may continue.

Above all, IT buyers should not listen to any FUD from traditional relational vendors suggesting that this is yet another new technology, like object databases, that will eventually fall to earth with a thud. Columnar database technology proved its superiority in many situations long ago in the non-relational world, with CCA’s Model 204, and has found uses continuously since then, like bit-mapped indexing. Most times there’s a fair BI matchup, as with some of the TPC benchmarks of the last seven years, columnar comes out well ahead. Under whatever name, columnar database technology is not going away. Therefore, its markets will continue to grow relative to row-oriented relational. For IT buyers, acquiring columnar BI solutions like HP’s Vertica is simply being smart and getting a little ahead of the curve.

1 comment:

Business Intelligence said...

I attempted to show the impact of columnar database technology on the basic premise of business intelligence - the ability to have business users perform ad-hoc analytics and reporting tasks over as much data as possible.