Note: this is a post of an article published in Nov. 2011 in another venue, and the vendors profiled here have all upgraded their Big Data stories significantly since then. Imho, it remains useful as a starting point for assessing their Big Data strategies, and deciding how to implement a long-term IT Big Data strategy oneself.
In recent weeks, Oracle, EMC, and IBM have issued announcements that begin to flesh out in solutions their vision of Big Data, its opportunities, and its best practices. Each of these vendor solutions has significant advantages for particular users, and all three are works in progress.
However, as presented so far, Oracle’s and EMC’s approaches appear to have significant limitations compared to IBM’s. If the pair continues to follow the same Big Data strategies, I believe that many of their customers will find themselves significantly hampered in dealing with certain types of Big Data analysis over the long run – an experience somewhat like voluntarily restricting yourself to one leg while running a marathon.
Let’s start by reviewing the promise and architectural details of Big Data, then take a look at each vendor’s strategy in turn.
Big Data, Big Challenges
As I noted in an earlier piece, some of Big Data is a relabeling of the incessant scaling of existing corporate queries and their extension to internal semi-structured (e.g., corporate documents) and unstructured (e.g., video, audio, graphics) data. The part that matters the most to today’s enterprise, however, is the typically unstructured data that is an integral part of customers’ social-media channel – including Facebook, instant messaging, Twitter, and blogging. This is global, enormous, very fast-growing, and increasingly integral (according to IBM’s recent CMO survey) to the vital corporate task of engaging with a "customer of one" throughout a long-term relationship.
However, technically, handling this kind of Big Data is very different from handling a traditional data warehouse. Access mechanisms such as Hadoop/MapReduce combine open-source software, large amounts of small or PC-type servers, and a loosening of consistency constraints on the distributed transactions (an approach called eventual consistency). The basic idea is to apply Big Data analytics to queries where it doesn’t matter if some users get "old" rather than the latest data, or if some users get an answer while others don’t. As a practical matter, this type of analytics is also prone to unexpected unavailability of data sources.
The enterprise cannot treat this data as just another BI data source. It differs fundamentally in that the enterprise can be far less sure that the data is current – or even available at all times. So, scheduled reporting or business-critical computing based on Big Data is much more difficult to pull off. On the other hand, this is data that would otherwise be unavailable for BI or analytics processes – and because of the approach to building solutions, should be exceptionally low-cost to access.
However, pointing the raw data at existing BI tools would be like pointing a fire hose at your mouth, with similarly painful results. Instead, the savvy IT organization will have plans in place to filter Big Data before it begins to access it.
Filtering is not the only difficulty. For many or most organizations, Big Data is of such size that simply moving it from its place in the cloud into an internal data store can take far longer than the mass downloads that traditionally lock up a data warehouse for hours. In many cases, it makes more sense to query on site and then pass the much smaller result set back to the end user. And as the world of cloud computing keeps evolving, the boundary between "query on site" and "download and query" keeps changing.
So how are Oracle, EMC, and IBM dealing with these challenges?
Oracle: We Control Your Vertical
Reading the press releases from Oracle OpenWorld about Oracle’s approach to Big Data reminds me a bit of the old TV show Outer Limits, which typically began with a paranoia-inducing voice intoning "We control your horizontal, we control your vertical …" as the screen began flickering.
Oracle’s announcements included specific mechanisms for mass downloads to Oracle Database data stores in Oracle appliances (Oracle Loader for Hadoop), so that Oracle Database could query the data side-by-side with existing enterprise data, complete with data-warehouse data-cleansing mechanisms.
The focus is on Oracle Exalytics BI Machine, which combines Oracle Database 11g and Oracle’s TimesTen in-memory database for additional BI scalability. In addition, there is a "NoSQL" database that claims to provide "bounded latency" (i.e., it limits the "eventual" in "eventual consistency"), although how it should combine with Oracle’s appliances was not clearly stated.
The obvious advantage of this approach is integration, which should deliver additional scalability on top of Oracle Database’s already-strong scalability. Whether that will be enough to make the huge leap to handling hundred-petabyte data stores that change frequently remains to be seen.
At the same time, these announcements implicitly suggest that Big Data should be downloaded to Oracle databases in the enterprise, or users should access Big Data via Oracle databases running in the cloud, but provide no apparent way to link cloud and enterprise data stores or BI. To put it another way, Oracle is presenting a vision of Big Data used by Oracle apps, accessed by Oracle databases using Oracle infrastructure software and running on Oracle hardware with no third party needed. We control your vertical, indeed.
What also concerns me about the company’s approach is that there is no obvious mecha-nism either for dealing with the lateness/unavailability/excessive lack of quality of Big Data, or for choosing the optimal mix of cloud and in-enterprise data location. There are only hints: the bounded latency of Oracle NoSQL Database, or the claim that Oracle Data Integrator with Application Adapter for Hadoop can combine Big Data with Oracle Database data – in Oracle Database format and in Oracle Database data stores. We control your horizontal, too. But how well are we controlling it? We’ll get back to you on that.
EMC: Competing with the Big Data Boys
The recent EMC Forum in Boston in many ways delivered what I regard as very good news for the company and its customers. In the case of Big Data, its acquisition of Greenplum with its BI capabilities led the way. And Greenplum finally appears to have provided EMC with the data management smarts it has always needed to be a credible global information management solutions vendor. In particular, Greenplum is apparently placing analytical intelligence in EMC hardware and software, giving EMC a great boost in areas such as being able to handle querying within the storage device and monitoring distributed systems (such as VCE’s VBlocks) for administrative purposes. These are clearly leading-edge, valuable features.
EMC’s Greenplum showed itself to be a savvy supporter of Big Data. It supports the usual third-party suspects for BI: SQL, MapReduce, and SAS among others. Querying is "software shared-nothing", running in virtual machines on commodity VBlocks and other scale-out/grid x86 hardware. Greenplum has focused on the fast-deploy necessities of the cloud, claiming a 25-minute data-model change – something that has certainly proved difficult in large-scale data warehouses in the past.
Like Oracle, Greenplum offers "mixed" columnar and row-based relational technology; un-like Oracle, it is tuned automatically, rather than leaving it up to the customer how to mix and match the two. However, its answer for combining Big Data and enterprise data is also tactically similar to Oracle’s: download into the Greenplum data store.
Of our three vendors, EMC via Greenplum has been the most concrete about what one can do with Big Data, offering specific support for combining social graphing, customer-of-one tracking of Twitter/Facebook posts, and mining of enterprise customer data. The actual demo had an unfortunate "1984" flavor, however, with a customer’s casual chat about fast cars being used to help justify doubling his car insurance rate.
The bottom line with Greenplum appears to be that its ability to scale is impressive, even with Big Data included, and it is highly likely to provide benefits out of the box to the savvy social-media analytics implementer. Still, it avoids, rather than solves, the problems of massive querying across Big Data and relational technology – it assumes massive downloads are possible and data is "low latency" and "clean", where in many cases it appears that this will not be so. EMC Greenplum is not as "one-vendor" a solution as Oracle’s, but it does not have the scalability and robustness track record of Oracle, either.
IBM: Avoiding Architectural Lock-In
At first glance, IBM appears to have a range of solutions for Big Data similar to Oracle and EMC – but more of them. Thus, it has the Netezza appliance; it has InfoSphere BigInsights for querying against Hadoop; it has the ability to download data into both its Informix/in-memory technology and DB2 databases for in-enterprise data warehousing; and it offers various Smart Analytics System solutions as central BI facilities.
Along with these, it provides Master Data Management (InfoSphere MDM), data-quality features (InfoSphere DataStage), InfoSphere Streams for querying against streaming Web sensor data (like mobile GPS), and the usual quick-deployment models and packaged hardware/software solutions on its scale-out and scale-up platforms. And, of course, everyone has heard of Watson – although, intriguingly, its use cases are not yet clearly pervasive in Big Data implementations.
To my mind, however, the most significant difference in IBM’s approach to Big Data is that it offers explicit support for a wide range of ways to combine Big-Data-in-place and enterprise data in queries. For example, IBM’s MDM solution allows multiple locations for customer data and supports synchronization and replication of linked data. Effectively used, this allows users to run alerts against Big Data in remote clouds or dynamically shift customer querying between private and public clouds to maximize performance. And, of course, the remote facility need not involve an IBM database, because of the MDM solution’s cross-vendor "data virtualization" capabilities.
Even IBM’s traditional data warehousing solutions are joining the fun. The IBM IOD conference introduced the idea of a "logical warehouse", which departs from the idea of a single system or cluster that contains the enterprise’s "one version of the truth", and towards the idea of a "truth veneer" that looks like a data warehouse from the point of view of the analytics engine but is actually multiple operational, data-warehouse, and cloud data stores. And, of course, IBM’s Smart Analytics Systems run on System x (x86), Power (RISC) and System z (mainframe) hardware.
On the other hand, there are no clear IBM guidelines for optimizing an architecture that combines traditional enterprise BI with Big Data. It gives one the strong impression that IBM is providing customers with a wide range of solutions, but little guidance as to how to use them. That IBM does not move the customer towards an architecture that may prevent effective handling of certain types of Big-Data queries is good news; that IBM does not yet convey clearly how these queries should be handled, not so much.
Composite Software and the Missing Big-Data Link
One complementary technology that users might consider is data virtualization (DV), as provided by vendors such as Composite Software and Denodo. In these solutions, subqueries on data of disparate types (such as Big Data and traditional) are optimized flexibly and dynamically, with due attention to "dirty" or unavailable data. DV solution vendors’ accrued wisdom can be simply summed up: usually, querying on-site instead of doing a mass download is better.
How to deal with the "temporal gap" between "eventually consistent" Big Data and "hot off the press" enterprise sales data is a matter for customer experimentation and fine-tuning, but that customer can always decide what to do with full assurance that subqueries have been optimized and data cleansed in the right way.
The Big Data Bottom Line
To me, the bottom line of all of this Big Data hype is that there is indeed immediate business value in there, and specifically in being able to go beyond the immediate customer interaction to understand the customer as a whole and over time – and thereby to establish truly win-win long-term customer relationships. Simply by looking at the social habits of key consumer "ultimate customers," as the Oracle, EMC, and IBM Big Data tools already allow you to do, enterprises of all sizes can fine-tune their interactions with the immediate customer (B2B or B2C) to be far more cost-effective.
However, with such powerful analytical insights, it is exceptionally easy to shoot oneself in the foot. Even skipping quickly over recent reports, I can see anecdotes of "data overload" that paralyze employees and projects, trigger-happy real-time inventory management that actually increases costs, unintentional breaches of privacy regulations, punitive use of newly public consumer behavior that damages the enterprise’s brand or perceived "character", and "information overload" on the part of the corporate strategist.
The common theme running through these user stories is a lack of vendor-supplied context that would allow the enterprise to understand how to use the masses of new Big Data properly, and especially an understanding of the limitations of the new data.
Thus, in the long run, the best performers will seek a Big-Data analytics architecture that is flexible, handles the limitations of Big Data as the enterprise needs them handled, and allows a highly scalable combination of Big Data and traditional data-warehouse data. So far, Oracle and EMC seem to be urging customers somewhat in the wrong direction, while IBM is providing a "have it your way" solution; but all of their solutions could benefit strongly from giving better optimization and analysis guidance to IT.
In the short run, users will do fine with a Big-Data architecture that does not provide an infrastructure support "leg" for a use case that they do not need to consider. In the long run, the lack of that "leg" may be a crippling handicap in the analytics marathon. IT buyers should carefully consider the long-run architectural plans of each vendor as they develop.
2 comments:
I truly like to reading your post. Thank you so much for taking the time to share such a nice information.
If we consider the Big data engineering automation, then adaptive learning is an excellent way to make it successful.
Post a Comment