Wednesday, June 29, 2011

Big Data, MapReduce, Hadoop, NoSQL: Pay No Attention to the Relational Technology Behind the Curtain

One of the more interesting features of vendors’ recent marketing push to sell BI and analytics is the emphasis on the notion of Big Data, often associated with NoSQL, Google MapReduce, and Apache Hadoop – without a clear explanation of what these are, and where they are useful. It is as if we were back in the days of “checklist marketing”, where the aim of a vendor like IBM or Oracle was to convince you that if competitors’ products didn’t support a long list of features, that those competitors would not provide you with the cradle-to-grave support you needed to survive computing’s fast-moving technology. As it turned out, many of those features were unnecessary in the short run, and a waste of money in the long run; remember rules-based AI? Or so-called standard UNIX? The technology in those features was later to be used quite effectively in other, more valuable pieces of software, but the value-add of the feature itself turned out to be illusory.

As it turns out, we are not back in those days, and Big Data via Hadoop and NoSQL does indeed have a part to play in scaling Web data. However, I find that IT buyer misunderstandings of these concepts may indeed lead to much wasted money, not to mention serious downtime. These misunderstandings stem from a common source: marketing’s failure to explain how Big Data relates to the relational databases that have fueled almost all data analysis and data-management scaling for the last 25 years. It resembles the scene in Wizard of Oz where a small man, trying to sell himself as a powerful wizard by manipulating stage machines from behind a curtain, becomes so wrapped up in the production that when someone notes “There’s a man behind the curtain” the man shouts “Pay no attention to the man behind the curtain!” In this case, marketers are shouting about the virtues of Big Data related to new data management tools and “NoSQL” that they fail to note the extent to which relational technology is complementary to, necessary to, or simply the basis of, the new features.

So here is my understanding of the present state of the art in Big Data, and the ways in which IT buyers should and should not seek to use it as an extension of their present (relational) BI and information management capabilities. As it turns out, when we understand both the relational technology behind the curtain and the ways it has been extended, we can do a much better job of applying Big Data to long-term IT tasks.

The best way to understand the place of Hadoop in the computing universe is to view the history of data processing as a constant battle between parallelism and concurrency. Think of the database as a data store plus a protective layer of software that is constantly being bombarded by transactions – and often, another transaction on a piece of data arrives before the first is finished. To handle all the transactions, databases have two choices at each stage in computation: parallelism, in which two transactions are literally being processed at the same time, and concurrency, in which a processor switches between the two rapidly in the middle of the transaction. Pure parallelism is obviously faster; but to avoid inconsistencies in the results of the transaction, you often need coordinating software, and that coordinating software is hard to operate in parallel, because it involves frequent communication between the parallel “threads” of the two transactions.

At a global level (like that of the Internet) the choice now translates into a choice between “distributed” and “scale-up” single-system processing. As it happens, back in graduate school I did a calculation of the relative performance merits of tree networks of microcomputers versus machines with a fixed number of parallel processors, which provides some general rules. There are two key factors that are relevant here: “data locality” and “number of connections used” – which means that you can get away with parallelism if, say, you can operate on a small chunk of the overall data store on each node, and if you don’t have to coordinate too many nodes at one time.

Enter the problems of cost and scalability. The server farms that grew like Topsy during Web 1.0 had hundreds and thousands of PC-like servers that were set up to handle transactions in parallel. This had obvious cost advantages, since PCs were far cheaper; but data locality was a problem in trying to scale, since even when data was partitioned correctly in the beginning between clusters of PCs, over time data copies and data links proliferated, requiring more and more coordination. Meanwhile, in the High Performance Computing (HPC) area, grids of PC-type small machines operating in parallel found that scaling required all sorts of caching and coordination “tricks”, even when, by choosing the transaction type carefully, the user could minimize the need for coordination.

For certain problems, however, relational databases designed for “scale-up” systems and structured data did even less well. For indexing and serving massive amounts of “rich-text” (text plus graphics, audio, and video) data like Facebook pages, for streaming media, and of course for HPC, a relational database would insist on careful consistency between data copies in a distributed configuration, and so could not squeeze the last ounce of parallelism out of these transaction streams. And so, to squeeze costs to a minimum, and to maximize the parallelism of these types of transactions, Google, the open source movement, and various others turned to MapReduce, Hadoop, and various other non-relational approaches.

These efforts combined open-source software, typically related to Apache, large amounts of small or PC-type servers, and a loosening of consistency constraints on the distributed transactions – an approach called eventual consistency. The basic idea was to minimize coordination by identifying types of transactions where it didn’t matter if some users got “old” rather than the latest data, or it didn’t matter if some users got an answer but others didn’t. As a communication from Pervasive Software about an upcoming conference shows, a study of one implementation finds 60 instances of unexpected unavailability “interruptions” in 500 days – certainly not up to the standards of the typical business-critical operational database, but also not an overriding concern to users.

The eventual consistency part of this overall effort has sometimes been called NoSQL. However, Wikipedia notes that in fact it might correctly be called NoREL, meaning “for situations where relational is not appropriate.” In other words, Hadoop and the like by no means exclude all relational technology, and many of them concede that relational “scale-up” databases are more appropriate in some cases even within the broad overall category of Big Data (i.e., rich-text Web data and HPC data). And, indeed, some implementations provide extended-SQL or SQL-like interfaces to these non-relational databases.

Where Are the Boundaries?
The most popular “spearhead” of Big Data, right now, appears to be Hadoop. As noted, it provides a distributed file system “veneer” to MapReduce for data-intensive applications (including Hadoop Common that divides nodes into a master coordinator and slave task executors for file-data access, and Hadoop Distributed File System [HDFS] for clustering multiple machines), and therefore allows parallel scaling of transactions against rich-text data such as some social-media data. It operates by dividing a “task” into “sub-tasks” that it hands out redundantly to back-end servers, which all operate in parallel (conceptually, at least) on a common data store.

As it turns out, there are also limits even on Hadoop’s eventual-consistency type of parallelism. In particular, it now appears that the metadata that supports recombination of the results of “sub-tasks” must itself be “federated” across multiple nodes, for both availability and scalability purposes. And Pervasive Software notes that its own investigations show that using multiple-core “scale-up” nodes for the sub-tasks improves performance compared to proliferating yet more distributed single-processor PC servers. In other words, the most scalable system, even in Big Data territory, is one that combines strict and eventual consistency, parallelism and concurrency, distributed and scale-up single-system architectures, and NoSQL and relational technology.

Solutions like Hadoop are effectively out there “in the cloud” and therefore outside the enterprise’s data centers. Thus, there are fixed and probably permanent physical and organizational boundaries between IT’s data stores and those serviced by Hadoop. Moreover, it should be apparent from the above that existing BI and analytics systems will not suddenly convert to Hadoop files and access mechanisms, nor will “mini-Hadoops” suddenly spring up inside the corporate firewall and create havoc with enterprise data governance. The use cases are too different.

The remaining boundaries – the ones that should matter to IT buyers – are those between existing relational BI and analytics databases and data stores and Hadoop’s file system and files. And here is where “eventual consistency” really matters. The enterprise cannot treat this data as just another BI data source. It differs fundamentally in that the enterprise can be far less sure that the data is up to date – or even available at all times. So scheduled reporting or business-critical computing based on this data is much more difficult to pull off.

On the other hand, this is data that would otherwise be unavailable – and because of the low-cost approach to building the solution, should be exceptionally low-cost to access. However, pointing the raw data at existing BI tools is like pointing a fire hose at your mouth. The savvy IT organization needs to have plans in place to filter the data before it begins to access it.

The Long-Run Bottom LineThe impression given by marketers is that Hadoop and its ilk are required for Big Data, where Big Data is more broadly defined as most Web-based semi-structured and unstructured data. If that is your impression, I believe it to be untrue. Instead, handling Big Data is likely to require a careful mix of relational and non-relational, data-center and extra-enterprise BI, with relational in-enterprise BI taking the lead role. And as the limits to parallel scalability of Hadoop and the like become more evident, the use of SQL-like interfaces and relational databases within Big Data use cases will become more frequent, not less.

Therefore, I believe that Hadoop and its brand of Big Data will always remain a useful but not business-critical adjunct to an overall BI and information management strategy. Instead, users should anticipate that it will take its place alongside relational access to other types of Big Data, and that the key to IT success in Big Data BI will be in intermixing the two in the proper proportions, and with the proper security mechanisms. Hadoop, MapReduce, NoSQL, and Big Data, they’re all useful – but only if you pay attention to the relational technology behind the curtain.

Pentaho and Open Source BI: The New SMB

On Monday, Pentaho, an open source BI vendor, announced Pentaho BI 4.0, its new release of its “agile BI” tool. To understand the power and usefulness of Pentaho, you should understand the fundamental ways in which the markets that we loosely call SMB have changed over the last 10 years.

First, a review. Until the early 1990s, it was a truism that computer companies in the long run would need to sell to central IT at large enterprises, eventually – else the urge of CIOs to standardize on one software and hardware vendor would favor larger players with existing toeholds in central IT. This was particularly true in databases, where Oracle sought to recreate the “nobody ever got fired for buying IBM” hardware mentality of the 1970s in software stacks. It was not until the mid-1990s that companies such as Progress Software and Sybase (with its iAnywhere line) showed that databases delivering near-lights-out administration could survive the Oracle onslaught. Moreover, companies like Microsoft showed that software aimed at the SMB could over time accumulate and force its way into central IT – not only Windows, Word, and Excel, but also SQL Server.

As companies such as IBM discovered with the bursting of the Internet bubble, this “SMB” market was surprisingly large. Even better, it was counter-cyclical: when large enterprises whose IT was a major part of corporate spend cut IT budgets dramatically, SMBs kept right on paying the yearly license fees for the apps on which they ran, which in turn hid the brand on the database or app server. Above all, it was not driven by brand or standards-based spending, nor even solely by economies of scale in cost.

In fact, the SMB buyer was and is distinctly and permanently different from the large-enterprise IT buyer. Concern for costs may be heightened, yes; but also the need for simplified user interfaces and administration that a non-techie can handle. A database like Pervasive could be run by the executive at a car dealership, who would simply press a button to run backup on his or her way out on the weekend, or not even that. The ability to fine-tune for maximum performance is far less important than the avoidance of constant parameter tuning. The ability to cut hardware costs by placing apps in a central location matters much less than having desktop storage to work on when the server goes down.

But in the early 2000s, just as larger vendors were beginning to wake up to the potential of this SMB market, a new breed of SMB emerged. This Web-focused SMB was and is tech-savvy, because using the Web more effectively is how it makes its money. Therefore, the old approach of Microsoft and Sybase when they were wannabes – provide crude APIs and let the customer do the rest – was exactly what this SMB wanted. And, again, this SMB was not just the smaller-sized firm, but also the skunk works and innovation center of the larger enterprise.

It is this new type of SMB that is the sweet spot of open source software in general, and open source BI in particular. Open source has created a massive “movement” of external programmers that have moved steadily up the software stack from Linux to BI, and in the process created new kludges that turn out to be surprisingly scalable: MapReduce, Hadoop, noSQL, and Pentaho being only the latest examples. The new SMB is a heavy user of open source software in general, because the new open source software costs nothing, fits the skills and Web needs of the SMB, and allows immediate implementation of crude solutions plus scalability supplied by the evolution of the software itself. Within a very few years, many users, rightly or wrongly, were swearing that MySQL was outscaling Oracle.

Translating Pentaho BI 4.0
The new features in Pentaho BI can be simply put, because the details simply show that they deliver what they promise:

· Simple, powerful interactive reporting – which apparently tends to be used more for ad-hoc reporting that the traditional enterprise reporting, but can do either;
· A more “usable” and customizable user interface with the usual Web “sizzle”;
· Data discovery “exploration” enhancements such as new charts for better data visualization.

These sit atop a BI tool that distinguishes itself by “data integration” that handles an exceptional number of input data warehouses and data stores for inhaling to a temporary “data mart” for each use case.

With these features, Pentaho BI, I believe, is valuable especially to the new type of SMB. For the content-free buzz word “agile BI”, read “it lets your techies attach quickly to your existing databases as well as Big Data out there on the Web, and then makes it easy for you to figure out how to dig deeper as a technically-minded user who is not a data-mining expert.” Above all, Pentaho has the usual open source model, so it’s making its money by services and support – allowing the new SMB to decide exactly how much to spend. Note also Pentaho’s alliance not merely with the usual cloud open source suspects like Red Hat but also with database vendors with strong BI-performance technology such as Vertica.

The BI Bottom Line
No BI vendor is guaranteed a leadership position in cloud BI these days – the field is moving that fast. However, Pentaho is clearly well suited to the new SMB, and also understands the importance of user interfaces, simplicity for the administrator, ad hoc querying and reporting, and rapid implementation to both new and old SMBs.

Pentaho therefore deserves a closer look by new-SMB IT buyers, either as a cloud supplement to existing BI or as the core of low-cost, fast-growing Web-focused BI. And, remember, these have their counterparts in large enterprises – so those should take a look as well. Sooner than I expected, open source BI is proving its worth.

Wednesday, June 22, 2011

This Made Me Sick To My Stomach

I just finished reading the report summary of the International Earth system expert workshop on ocean stresses and impacts, released on Monday. Here is what I think the headline of that report should have been:

The Oceans Are Mostly DEAD Unless We Reduce Carbon Emissions Drastically AND Set Up a Global Police Over ALL Ocean Uses NOW

I won’t bother with the evidence of cascading destruction, with more to come, and the explanation of how much of what we do now that affects the oceans reinforces that cascade. That has been covered somewhat in Joe Romm’s blog, I will simply note that the end point will be a mass ocean species destruction comparable to any in the past, plus massive ocean acid “dead zones” where nothing can live and a time to recover in the thousands of years. One species that might survive is jellyfish – and it has “low nutritional value,” i.e., you can’t live on jellyfish.

The one thing that no one seems to be covering is what they say we should do to avoid this. They say that everyone on Earth must stop all ocean misuses now, and to do this the UN should set up a global enforcement body. The burden of proof will be on all ocean users – yes, that includes ocean liners and drillers for oil, gas, and minerals – to show that their next use will not be harmful, else they can’t do it. Contributions to the body would be mandatory, and it would have jurisdiction over the “High Seas” that aren’t the property of particular nations, but obviously it will affect waters that are now said to be the property of particular nations, as well as fisheries.

If you want to go fish, get permission from the global enforcement body. If you want to ship components from abroad, get permission from the global enforcement body. If you want to drill in the Arctic now that it’s getting warmer and less icy, get permission. If you dump fertilizer and waste into rivers and it’s washed out to sea, the commission will be after you. And the commission’s key criteria will be: Does this add to the carbon footprint? Does this make a dead ocean more likely? Is this a sustainable use?

As far as I can see, the only reason the workshop would recommend such a thing is that the situation is that serious. And it’s serious not only because we lose seafood, but because the ocean will reach its capacity for absorbing the excess carbon we’re dumping in the atmosphere, and then global warming on land will get worse, faster than we expect even now – leading to faster sea rise and more massive storms that “salt” estuaries that product a significant proportion of the world’s food, more droughts over much of the world that desertify another major proportion of the world’s food, and possibly to “toxic blooms” in the waters next to the land that periodically release toxic gases that kill those living on the shore.

Just thinking about the ocean, near which I have lived for most of my life, being dead makes me sick to my stomach.

Sunday, June 12, 2011

In the End, Godel Has Won

This post was originally written last fall, and set aside as being too speculative. I felt that there was too little evidence to back up my idea that “accepting limits” would pay off in business.

Since then, however, the Spring 2011 edition of MIT Sloan Management Review has landed on my desk. In it, a new “sustainability” study shows that “embracers” are delivering exceptional comparative advantage, and that a key characteristic of “embracers” is that they view “sustainability” as a culture to be “wired into the business” – “it’s the mindset”, says Bowman of Duke Energy. According to Wikipedia, the term “sustainability” itself is fundamentally about accepting limits, including environmental “carrying capacity” limits, energy limits, and limits in which use rates don’t exceed regeneration rates.

This attitude is in stark contrast to the attitude pervading much of human history. I myself have grown up in a world in which one of the fundamental assumptions, one of the fundamental guides to behavior, is that it is possible to do anything. The motto of the Seabees in World War II, I believe, was “The difficult we do immediately; the impossible takes a little longer.” Over and over, we have believed adjustments in the market, inventions and advances, daring to try something else, an all-out effort, something, anything, can fix any problem.

In mathematics, they, too, believed at the turn of the century that any problem was solvable: that any truth of any consistent, infinite mathematical system could be proved. And then Kurt Godel came along and showed that in every such system, either you could not prove all truths or you could also prove false things, one or the other. And over the next thirty years, mathematics applied to computing showed that some problems were unsolvable, and others had a fundamental lower limit on the time taken to solve the problem that meant that they could not be solved before the universe ended. By accepting these limits, mathematics and programming have flourished.

This mindset is fundamentally different from the “anything is possible” mindset. It says to work smarter, not harder, by not wasting your time on the unachievable. It says to identify the highly improbable up front and spend most of your time on solutions that don’t involve that improbability. It says, as agile programming does, that we should focus on changing our solutions as we find out these improbabilities and impossibilities, rather than piling on patch after patch. It also says, as agile programming does, that while by any short-run calculation the results of this mindset might seem worse than the results of the “anything is possible” mindset, over the long run – and frequently over the medium term – it will produce better results.

It seems more and more apparent to me that we have finally reached the point where the “anything is possible” approach is costing us dearly. I am speaking specifically about climate change – one key driver for the sustainability movement. The more I become familiar with the overwhelming scientific evidence for massive human-caused climate change and the increasing inevitability of at least some major costs of that change in every locality and country of the globe, the more I realize that an “anything is possible” mentality is a fundamental cause of most people’s failure to respond adequately so far, and a clear predictor of future failure.

Let me be more specific: as noted in the UN scientific conferences and recent additional data, “business as usual” is leading us to a carbon dioxide concentration of 1000 ppm in the atmosphere, of which about 450 ppm or 150-200 ppm over the natural amount is already “baked in”. This will result, at minimum, in global increases in temperature of 5-10 degrees Fahrenheit, which will result, among other things, in order-of-magnitude increases in the damage caused by extreme weather events, the extinction of many ecosystems supporting existing urban and rural populations – because many of these ecosystems are blocked from moving north or south by paved human habitations – so that food and shelter production must both change their location and find new ways to deliver to new locations, movement of all populations from locations on seacoasts up to 20 feet above existing sea level, and adjustment of a large proportion of heating and cooling systems to a new mix of the two – not to mention drought, famine, and economic stress. And these are just the effects over the next 60 or so years.

Adjusting to this will place additional costs on everyone, very possibly similar to a 10% tax yearly on every individual and business in every country for the next 50 years, no matter how wealthy or adept. Continuing “business as usual” for another 30 years would result in a similar, almost equally costly additional adjustment.

Our response to this so far has been in the finest tradition of “anything is possible”. We search for technological fixes under the belief that they will solve the problem, since they appear to have done so before. Most of us – except the embracers – assume that existing business incentives, focused on cutting costs – but these costs have not yet occurred – will somehow respond years before the impact begins to be felt. (Embracers, by the way, actively seek out new metrics to capture things like carbon emissions’ negative effects) We are skeptical and suspicious, since those who have predicted doom before, for whatever reason, have generally seemed to have turned out to be wrong. We hide our heads in the sand, because we have too much else to do and concerns that seem more immediate. We are distracted by possible fixes, and by their flaws.

The “embrace limits” mindset for climate changes makes one simple change: accept steady absolute reductions in carbon emissions as a limit. For example, every business, every country, every region, every county accepts that every year, its emissions are to be reduced by 1% in that year. If a business, that business also accepts that its products’ emissions are to be reduced by 1% in that year, no matter how successful the year has been. If a locality does better one year, it still is expected not to increase emissions the next year. If a country rejects this idea, investments from conforming countries are reduced by 1% each year, and products accepted from that country are expected to comply.

But this is a crude, blunt-force suggested application of “embrace limits”. There are all sorts of other applications. Investors will no longer invest in equities that seem to promise 2% long-term returns above historical norms, and will limit the amount of their capital invested in “bets,” because those investments are overwhelmingly likely to be con jobs. Project managers will no longer use metrics like time to deployment, but rather “time to value” and “agility”, because there is a strong possibility that during the project, the team will discover a limit and need to change its objective.

Because, fundamentally, climate change is a final, clear signal that Godel has won. Whether we accept limits or not, they are there; and the less we accept them and use them to work smarter, the more it costs us.