Tuesday, May 18, 2010

EMC World: The Dedupe Revolution

What, to my mind, was the biggest news out of EMC World? The much-touted Private Cloud? Don’t think so. The message that, as one presenter put it, “Tape Sucks”? Sorry. FAST Cache, VPLEX, performance boosts, cost cuts? Not this time. No, what really caught my attention was a throw-away slide showing that almost a majority of EMC customers have already adopted some form of deduplication technology, and that in the next couple of years, probably a majority of all business storage users will have done so.
Why do I think this is a big deal? Because technology related to deduplication holds the potential of delivering benefits greater than cloud; and user adoption of deduplication indicates that computer markets are ready to implement that technology. Let me explain.
First of all, let’s understand what “dedupe”, as EMC calls it, and a related technology, compression, mean to me. In its initial, technical sense, deduplication means removing duplicates in data. Technically, compression means removing "noise" -- in the information-theory sense of removing bits that aren’t necessary to convey the information. Thus, for example, removing all but one occurrence of the word “the” in storing a document would be deduplication; using the soundex algorithm to represent “the” in two bytes would be compression.
However, today popular “compression” products often use technical-deduplication as well; for example, columnar databases compress the data via such techniques as bit-mapped indexing, and also de-duplicate column values in a table. Likewise, data deduplication products may apply compression techniques to shrink the storage size of data blocks that have already been deduplicated. So when we refer to “dedupe”, it often includes compression, and when we refer to compressed data, it often has been deduplicated as well. To try to avoid confusion, I refer to “dedupe” and “compress” to mean the products, and deduplication and compression to mean the technical terms.
When I state that there is an upcoming “dedupe revolution”, I really mean that deduplication and compression combined can promise a new way to improve not only backup/restore speed, but also transaction processing performance. Because, up to now, “dedupe” tools have been applied across SANs (storage area networks), while “compress” tools are per-database, “dedupe” products simply offer a quicker path than “compress” tools to achieving these benefits globally, across an enterprise.
These upcoming “dedupe” products are perhaps best compared to a sponge compressor that squeezes all excess “water” out of all parts of a data set. That means not only removing duplicate files or data records, but also removing identical elements within the data, such as all frames from a loading-dock video camera that show nothing going on. Moreover, it means compressing the data that remains, such as documents and emails whose verbiage can be encoded in a more compact form. When you consider that semi-structured or unstructured data such as video, audio, graphics, and documents makes up 80-90% of corporate data, and that the most “soggy” data types such as video use up the most space, you can see why some organizations are reporting up to 97% storage-space savings (70-80%, more conservatively) where “dedupe” is applied. And that doesn’t include some of the advances in structured-data storage, such as the columnar databases referred to above that “dedupe” columns within tables.
So, what good is all this space saving? Note the fact that the storage capacity that users demand has been growing by 50-60 % a year, consistently, for at least the last decade. Today’s “dedupe” may not be appropriate for all storage; but where it is, it is equivalent to setting back the clock 4-6 years. Canceling four years of storage acquisition is certainly a cost-saver. Likewise, backing up and restoring “deduped” data involves a lot less to be sent over a network (and the acts of deduplicating and reduplicating during this process add back only a fraction of the time saved), so backup windows and overhead shrink, and recovery is faster. Still, those are not the real reasons that “dedupe” has major long-term potential.
No, the real long-run reason that storage space saving matters is that it speeds retrieval from disk/tape/memory, storage to disk/tape/memory, and even processing of a given piece of data. Here, the recent history of “compress” tools is instructive. Until a few years ago,the effort of compressing and uncompressing tended to mean that compressed data actually took longer to retrieve, process, and re-store; but, as relational and columnar database users have found out, certain types of “compress” tools allow you to improve performance – sometimes by an order of magnitude. For example, recently, vendors such as IBM are reporting that relational databases such as DB2 benefit performance-wise from using “compressed” data. Columnar databases are showing that it is possible to operate on data-warehouse data in “compressed” form, except when it actually must be shown to the user, and thereby get major performance improvements.
So what is my vision of the future of “dedupe”? What sort of architecture are we talking about, 3-5 years from now? One in which the storage tiers below fast disk (and, someday, all the tiers, all the way to main memory) have “dedupe”-type technology added to them. In this context, it was significant that EMC chose at EMC World to trumpet “dedupe” as a replacement for Virtual Tape Libraries (VTL). Remember, VTL typically allows read/query access to older, rarely accessed data within a minute; so, clearly, deduped data on disk can be reduped and accessed at least as fast. Moreover, as databases and applications begin to develop the ability to operate on “deduped” data without the need for “redupe”, the average performance of a “deduped” tier will inevitably catch up with and surpass that of one which has no deduplication or compression technology.
Let’s be clear about the source of this performance speedup. Let us say that all data is deduplicated and compressed, taking up 1/5 as much space, and all operations can be carried out on “deduped” data instead of its “raw” equivalents. Then retrieval from any tier will be 5 times as fast and 5 times as much data can be stored in the next higher tier for even more performance gains. Processing this smaller data will take ½ to 1/5 as much time. Adding all three together and ignoring the costs of “dedupe”/”redupe”, a 50% speedup of an update and an 80% performance speedup of a large query seems conservative. Because the system will only need “dedupe”/”redupe” rarely, “dedupe” when the data is first stored and “redupe” whenever the data is displayed to a user in a report or query response, and because the task could be offloaded to specialty “dedupe”/”redupe” processors, on average “dedupe”/redupe” should add only minimal performance overhead to the system, and should subtract less than 10% from the performance speedup cited above. So, conservatively, I estimate the performance speedup from this future “dedupe” at 40-70%.
What effect is this going to have on IT, assuming that “the dedupe revolution” begins to arrive 1-2 years from now? First, it will mean that, 3-5 years out, the majority of storage, rather than a minority replacing some legacy backup, archive, or active-passive disaster recovery storage, will benefit from turning the clock back 4-6 years via “dedupe.” Practically, performance will improve dramatically and storage space per data item will shrink drastically, even as the amount of information stored continues its rapid climb – and not just in data access, but also in networking. Data handled in deduped form everywhere within a system also has interesting effects on security: the compression within “dedupe” is a sort of quick-and-dirty encryption that can make data pilferage by those who are not expert in “redupe” pretty difficult. Storage costs per bit of information stored will take a sharp drop; storage upgrades can be deferred and server and network upgrades slowed. When you add up all of those benefits, from my point of view, “the dedupe revolution” in many cases does potentially more for IT than the incremental benefits often cited for cloud.
Moreover, implementing “dedupe” is simply a matter of a software upgrade to any tier: memory, SSD, disk, or tape. So, getting to “the dedupe revolution” potentially requires much less IT effort than getting to cloud.
One more startling effect of dedupe: you can throw many of your comparative TCO studies right out the window. If I can use “dedupe” to store the same amount of data on 20% as much disk as my competitor, with 2-3 times the performance, the TCO winner will not be the one with the best processing efficiency or the greatest administrative ease of use, but the one with the best-squeezing “dedupe” technology.
What has been holding us back in the last couple of years from starting on the path to dedupe Nirvana, I believe, is customers’ wariness of a new technology. The EMC World slide definitively establishes that this concern is going away, and that there’s a huge market out there. Now, the ball is in the vendors’ court. This means that all vendors, not just EMC, will be challenged to merge storage “dedupe” and database “compress” technology to improve the average data “dry/wet” ratio, and “dedupify” applications and/or I/O to ensure more processing of data in its “deduped” state. (Whether EMC’s Data Domain acquisition complements its Avamar technology completely or not, the acquisition adds the ability to apply storage-style “dedupe” to a wider range of use cases; so EMC is clearly in the hunt). Likewise, IT will be challenged to identify new tiers/functions for “dedupe,” and implement the new “dedupe” technology as it arrives, as quickly as possible. Gentlemen, start your engines, and may the driest data win!