Wednesday, February 10, 2010

Whither Hybrid Columnar?

Oracle’s announcement of its “hybrid columnar compression” option for its Exadata product last summer clearly relates to the renewed attention paid to columnar databases over the last year by columnists such as Merv Adrian and Curt Monash. This example of a “hybrid” between columnar and row-oriented technology makes life yet more complicated for the IT buyer of data warehousing and business intelligence (BI) solutions. However, there does seem to be some agreement between the advocates of columnar and row-oriented that sheds some light on the appropriate places for each – and for hybrid.

Daniel Abadi in an excellent post summarizes the spectrum. If I understand his taxonomy correctly, row-oriented excels in “random reads” where a single record with more than 2-3 fields is accessed (or for single inserts and updates); columnar excels for queries across a large number of records whose columns (or “fields”) contain some commonality that makes for effective compression. The hybrids attempt to achieve 80% (my figure, plucked from the air) of the performance advantages of both.

To follow this thought a bit further, Daniel divides the hybrid technologies into block-oriented or “PAX”, fractured mirror, and fine-grained. The PAX approach stores in column format for particular disk blocks; the fractured mirror operates in a real-world disaster recovery environment and treats one copy of the data store as row-oriented, the other as column-oriented, sending transactions to either as appropriate; the fine-grained hybrid is similar to PAX, but compresses the columns of particular fields of particular tables, not a whole disk block. Oracle appears to be an example of the PAX approach, while Vertica has some features that appear to implement the fine-grained approach.

I would argue that the future belongs more to pure columnar or fractured mirror than to row-oriented or the other two flavors of hybrid. Here is my reasoning: data stores continue to scale in size by 50% a year; the proportion of storage devoted to main memory and solid-state devices (SSDs) is likewise going up. The disadvantage of columnar in “random reads” is therefore decreasing, as databases are increasingly effective at ensuring that the data accessed is already in fast-response storage. In other words, it’s not just the number of I/Os, it’s what you are inputting from.

There is another factor: as database size goes up, the disadvantages of row-oriented queries increase. As database size increases, the number of “random reads” does not necessarily increase, but the amount of data that must be accessed in the average query does necessarily increase. Compression applied across all columns and indexes increases its advantage over selective compression and no compression at all in this case, because there is less data to upload. And the “query from hell” that scans all of a petabyte data warehouse is not only the extreme case of this increasing advantage; scaling the system’s ability to handle such queries is often a prime concern of IT.

I would also argue that the same trends make the pure or fractured-mirror columnar database of increasing interest for operational data stores that combine online transaction processing with lots of updates and continuous querying and reporting. For updates, the competition between columnar and row-oriented is getting closer, as many of these updates involve only 1-3 fields/columns of a row, while updates are most likely to affect the most recent data and therefore the data increasingly likely to be in main-memory cache. For inserts/deletes, updating in-memory indexes immediately along with “write to disk later” means that the need of columnar for multiple I/Os need not be a major disadvantage in many cases. And for online backup, the compressed data of the columnar approach wins hands down.

My takeaway with regard to Oracle Hybrid Columnar Compression, therefore, is that over time your mileage may vary. I would not be surprised if, someday, Oracle moved beyond “disk-block hybrid” to a fractured mirror approach, and that such an approach took over many of the high-end tasks for which vanilla row-oriented Oracle Database 11g r2 is now the database of choice.