Thoughts From a Retired Software IT Analyst: July 2015

Wednesday, July 29, 2015

Climate Change: The Poison of the Pseudo-Reasonable Economist

In the dog days of summer, as the average global land temperature for the first 6 months of the year is a whopping two-thirds of a degree Fahrenheit above any recorded temperature before it, and almost certainly as hot as or hotter than any time in the last million years, I find myself musing on one pernicious form of climate change obstructionism. Not climate change deniers -- their lies have been endlessly documented and the contrary evidence of accumulating data from appropriate scientists continues to mount. No, I am speaking rather of fundamentally flawed but seemingly rigorous arguments, especially from economists, that serve in the real world only to detract from the urgent message of climate change, and the will to face that message, by understating its likely impacts.

I gathered these two examples of the genre from the blog of Brad deLong, economics professor at Berkeley. I hope Prof. deLong will not be offended if I describe him as the packrat of economic theory (net-net, it's a compliment); he seems to publish both the realistic alarms of Joe Romm and the examples I am about to cite with equal gusto, and to do the same with some of the more dubious efforts of Milton Friedman and Martin Feldstein in other areas. In this case, I am going to use the screeds of Robert Pindyck and Martin Weitzman, cited over the last month, as examples of this kind of pseudo-reasonable economic analysis.

Weitzman: The Black Swan Is a Red Herring

Weitzman's recurring argument, which he has been making for a long time now, is that a "serious" effect of climate change is unlikely -- he calls it a "black swan" event -- but that because it could have unspecified catastrophic consequences, we should try to plan for it, just as a business plans for unlikely concerns like Hurricane Sandy in its "risk management" policies. Sounds reasonable, doesn't it? Except that, by any reasonable analysis of climate change's fundamental model and how it has played out over at least the last five years, merely catastrophic consequences are far more likely than uncatastrophic ones, and catastrophic consequences beyond what Prof. Weitzman seem to be contemplating are the likeliest of all.

When I first began reading up on the field 6 years ago, it was still possible to argue that the very conservative IPCC 2007 model (whose most likely scenario assuming everyone started doing something about climate change projected a little less than 2 degrees Celsius global temperature increase) was at least plausible. After all, back then, the data on Arctic sea ice, Greenland ice melt, and Antarctic ice melt was still not clearly permanently above the IPCC track -- not to mention the fact that permafrost had not clearly started melting.

However, even at that time it was clear to me from my reading that, in all likelihood, the IPCC and similar models were understating the case. They were not considering feedback effects from Arctic sea ice melt that was well in advance of "around 2100, if that" predictions, nor the effects of permafrost melt that was very likely to come. I would also admit that I thought the effects of climate change on weather in the US and Europe would not be visible and obvious enough for political action until around 2020. Meanwhile, Weitzman (2008) was publishing a paper that argued that climate science simply couldn't provide enough exact predictions about temperature increase and the like to make catastrophic climate change anything but a highly unlikely event in economic modeling.

Well, here we are 7 years later. Prof. Hansen has crystallized the most likely scenario by analyzing data on the last such event, 55 million years ago, and showing that a doubling of atmospheric carbon translates to a 4 degrees Centigrade increase in global temperature, two-thirds from the carbon itself, and 1/3 from related greenhouse-gas emissions and feedback effects. Moreover, a great deal has been done to elaborate on the more immediate weather and "catastrophic" effects of this increase, from Dust-Bowl-like drought in most of the US and much of Europe by the end of this century to sea level rises of at least 10+ feet worldwide -- and extension of salt-water poisoning of agriculture and water supplies to an additional 10 feet due to more violent storms.

I cannot say that I am surprised by any of this. I can also say that I see no sign in Prof. Weitzman's comments that he has even noticed it -- despite the fact that, according to Hansen's analysis, we have already blown past 2 degrees Celsius in long-run temperature increases and are beginning to talk about halting emissions growth at 700 ppm or about 5.5 degrees Celsius. No, according to Prof. Weitzman, catastrophic climate change continues to be a "black swan" event.

So the fundamental assumption of Weitzman's statistical analysis is completely wrong -- but why should we care, if it gets people to pay attention? Except that, as our entire history has confirmed and the last 6 years have reconfirmed, when people are told that something is pretty unlikely they typically take their time to do something about it. As temperature increases mount, the amount of catastrophe to be coped with and the amount to do to avoid further increases mounts exponentially. Just as with comets or asteroids striking the Earth -- but with much less justification -- appeals to "risk management" and "black swans" give us license to do just that. No, when you deny for six years the ever-clearer message of climate science that the climate-change forces causing catastrophic effects are likely, quantifiable on average, and large, you might as well be a climate change denier.

Pindyck: The Discount Rate of Death

I must admit, when I saw the name Pindyck but not the conclusions of his paper, I was prepared to be fascinated. I have always regarded his book on econometric modeling, which I first read in the late 1970s, as an excellent summary of the field, still useful after all these years. You can imagine my surprise when I found him echoing Weitzman about how "climate science simply isn't sure about the extent and impacts of climate change, and therefore we should treat those impacts as unlikely". But my jaw almost became unhinged when I read that "we really have no idea what the discount rate [for a given climate-change-inspired policy action in a cost-benefit analysis] should be", and so we should not even attempt to model the costs and benefits of climate change action except in wide-range probabilistic terms.

Iirc, the discount rate in a business-investment analysis is the rate of return that will justify investing in a project. Now, there are workarounds to estimate probabilities and therefore at least approximate return on investment for a particular investment -- but that isn't the source of my bemusement. Rather, it's the notion that one cannot come up with a discount rate for a climate-change-mitigation investment compared with an alternative, and therefore, one cannot do model-based cost-benefit analysis.

Here's my counter-example. Suppose a company must choose between two investments. One returns 5% per year over the next 5 years. The second contains exposed asbestos; it returns 10% over 5 years, and 20 years from now, everyone in the company during that period will die and the company will fold. What is the discount rate under which the company should choose investment 2? It's a trick question, obviously; the discount rate for investment 2 must be infinite to match its infinite costs, and therefore there is no such discount rate.

But that's my point. The costs of climate change are likely and catastrophic, and so you need a really high discount rate to justify the alternative of "business as usual". The only way you can get a low enough discount rate to justify "business as usual" is to assume that climate change catastrophe is very unlikely. And so, as far as I'm concerned, the "we don't know the discount rate" argument takes us right back to Weitzman's and Pindyck’s "climate change catastrophe is unlikely."

Thus, Pindyck's discount-rate argument is also a red herring, and a particularly dangerous one: It seems to move the playing field from climate science, which climate scientists can easily refute, to the arcana of econometrics. Not only does Pindyck fail on the climate science; he uses that failure to cloak inaction in pseudo-economic jargon. And so, when Pindyck's "analysis" winds up making it even harder than Weitzman's to argue for climate-change action, I regard it as particularly poisonous in effect.

The So-Called deLong Smackdown

Prof. deLong occasionally publishes an article called a “smackdown” in his blog that seems to correct him on something he clearly views himself as having erred on. Frankly, I don't view the above as a smackdown; although I wish he and Prof. Krugman would admit that they underestimated gold's disadvantages by comparing it to the S&P 500 index rather than the S&P 500 total return index. Rather, I view this as a wake-up call to both of them, if they truly want economics to deal with the real world. As Joe Romm points out, underestimation of effects for the sake of absolute sureness of a minimum effect by the IPCC is not new, nor is an extensive body of literature giving a picture both far more somber and far better reflected in current real-world weather and climate.

But what are we to make of Weitzman and Pindyck, who apparently have been denying that literature, and then using that denial to peddle a far weaker reason for action, for the last six years or so? 3 years, maybe, as the Arctic sea ice shrank to a new dramatic low only in 2012; but six? No; unless we succumb entirely to the old NPR comedy routine “It’s Dr. Science! He’s smarter than you are!”, this behavior is disingenuous and has poisonous effects. And, because any sort of modeling of the medium-term future should take account of economic effects, it hinders real-world planning just as much as real-world action. Heckuva job, economists – not.

Monday, July 27, 2015

In-Memory Computing Summit 2: Database Innovation

In the late 1990s, my boss at the time at Aberdeen Group asked me a thought-provoking question: Why was I continuing to cover databases? After all, he pointed out, it seemed at first glance like a mature market – because the pace of technology innovation had slowed down and nothing important seemed to be on the horizon. Moreover, consolidation meant that there were fewer and fewer suppliers to cover. My answer at the time was that users in “mature” markets like the mainframe still needed advice on key technologies, which they would not find because analysts following my boss’s logic would flee the field.
However, as it turned out, this was not the right reason at all – shortly thereafter, the database field saw a new round of innovations centering on columnar databases, data virtualization, and low-end Linux efforts that tinkered with the ACID properties of relational databases. These, in turn, led to the master data management, data governance, global repository, and columnar/analytic database technologies of the late 2000s. In the early 2010s, we saw the Hadoop-driven “NoSQL” loosening of commit constraints to enable rapid analytics on Big Data by sacrificing some data quality.
As a result, the late-1990s “no one ever got fired for buying Oracle” mature market wisdom is now almost gone from memory – and databases delivering analytics are close to the heart of many firms’ strategies. And so, it seems that the real reason to cover databases today is that their markets, far from being mature, are rapidly spawning new entrants and offering many technology-driven strategic reasons to upgrade and expand.
The recent 2015 In-Memory Computing Summit suggests that a new round of database innovation, driven by the needs listed in my first post, is bringing changes to user environments especially in three key areas:

1. Redefinition of data-store storage tiering, leading to redesign of data-access software;

2. Redefinition of “write to storage” update completion, allowing flexible loosening of availability constraints in order to achieve real-time or near-real-time processing of operational and sensor-driven Big Data; and

3. Balkanization of database architectures, as achieving Fast Data turns out to mean embedding a wide range of databases and database suppliers in both open source and proprietary forms.

Real-Time Flat-Memory Storage Tiering: Intel Is a Database Company

All right, now that I’ve gotten your attention, I must admit that casting Intel as primarily a database company is a bit of an exaggeration. But there’s no doubt in my mind that Intel is now thinking about its efforts in data-processing software as strategic to its future. Here’s the scenario that Intel laid out at the summit:
At present, only 20% of a typical Intel CPU is being used, and the primary reason is that it sits around waiting for I/O to occur (i.e., needed data to be loaded into main memory) – or, to use a long-disused phrase, applications running on Intel-based systems are I/O-bound. To fix this problem, Intel aims to ensure faster I/O, or, equivalently, the ability to service the I/O requests of each of multiple applications running concurrently, and do it faster. Since disk does not offer much prospect for I/O-speed improvement, Intel has proposed a software protocol standard, NVRAM(e), for flash memory. However, to ensure that this protocol does indeed speed things up adequately, Intel must write the necessary data-loading and data-processing software itself.
So will this be enough for Intel, so that it can go back to optimizing chip sets?
Well, I predict that Intel will find that speeding up I/O from flash storage, which treats flash purely as storage, will not be enough to fully optimize I/O. Rather, I think that the company will also need to treat flash as an extension of main memory: Intel will need to virtualize (in the old sense of virtual memory) flash memory and treat main memory and flash as if they were on the same top storage tier, with the balancing act between faster-response main memory and slower-response flash taking place "beneath the software covers." Or, to coin another phrase, Intel will need to provide virtual processors handling I/O as part of their DNA. And from there, it is only a short step to handling the basics of data processing in the CPU -- as IBM is already doing via IBM DB2 BLU Acceleration.

Real-Time Flat-Memory Storage Tiering: Redis Labs Discovers Variable Tiering

"Real time" is another of those computer science phrases that I hate to see debased by marketers and techie newbies. In the not-so-good old days, it meant processing and responding to every single (sensor) input in a timely fashion (usually less than a second) no matter what. That, in turn, meant always-up and performance-optimized software aimed specifically at delivering on that "no matter what" guarantee. Today, the term seems to have morphed into one in which basic old-style real-time stream data processing (e.g., keep the car engine running) sits cheek-by-jowl with do-the-best-you-can stream processing involving more complex processing of huge multi-source sensor-data streams (e.g., check if there's a stuck truck around the corner you might possibly bash into). The challenge in the second case (complex processing of huge data streams) is to optimize performance and then prioritize speed-to-accept-data based on the data's "importance".
I must admit to having favorites among vendors based on their technological approach and the likelihood that it will deliver new major benefits to customers, in the long run as well as the short. At this conference, Redis Labs was my clear favorite. Here's my understanding of their approach:
The Redis architecture begins with a database cluster with several variants, allowing users to trade off failover/high availability with performance and maximize main memory and processors in a scale-out environment. Then, however, the Redis Labs solution focuses on "operational" (online real-time update/modify-heavy sensor-driven transaction processing). To do this of course, the Redis Labs database puts data in main memory where possible. Where it is not (according to the presentation), the Redis Labs database treats flash as if it were main memory, mimicking flat-memory access on flash interfaces. As the presenter put it, at times of high numbers of updates, flash is main memory; at other times, it's storage.
The Redis Labs cited numerous benchmarks to show that the resulting database was the fastest kid on the block for "operational" data streams. To me, that's a side effect of a very smart approach to performance optimization that crucially includes the ideas of using flash as if it were main memory and of varying the use of flash as storage, meaning that sometimes all of flash is the traditional tier-2 storage and sometimes all of flash is tier-1 processing territory. And, of course, in situations where main memory and flash are all that is typically needed, for processing and storage, we might as well junk the tiering idea altogether: it's all flat-memory data processing.

memSQL and Update Completion

Redis Labs' approach may or may not complete the optimizations needed for flash-memory operational data processing. At the Summit, memSQL laid out the most compelling approach to the so-called "write-to-storage" issue that I heard.
I first ran into write-to-storage when I was a developer attending meetings of the committee overseeing the next generation of Prime Computer's operating system. As they described it, in those days, there was either main memory storage, which vanished whenever you turned off your PC or your system crashed, and there was everything else (disk and tape, mostly) that kept this information for a long time, whether the system crashed or not. So database data or files or anything that was new or changed didn't really "exist" until it was written to disk or tape. And that meant that further access to that data (modifications or reads) had to wait until the write-to-disk had finished. Not a major problem in the typical case; but where performance needs for operational (or, in those days, online transaction processing/OLTP) data processing required it, write-to-storage was and is a major bottleneck.
memSQL, sensibly enough, takes the "master-slave" approach to addressing this shortcoming. While the master software continues on its merry way with the updated data in main memory, the slave busies itself with writing the data to longer-term storage tiers (including flash used as storage). Problem solved? Not quite.
If the system crashes after a modification has come in but before the slave has finished writing (a less likely occurrence, but still possible), then both the first and second changes are lost. However, in keeping with the have-it-your-way approach of Hadoop, memSQL allows the user to choose a tradeoff between performance speed and what it calls a "high availability" version. And so, flash plus master-slave processing plus a choice of "availability" means that performance is increased in both operational and analytical processing of sensor-type data, the incidence of the write-to-storage problem is decreased, and the user can flexibly choose to accept some data loss to achieve the highest performance, or vice versa.

The Balkanization of Databases: The Users Begin To Speak

Balkanization may be an odd phrase for some readers, so let me add a little pseudo-history here. Starting a bit before 100 BC, the Roman Empire had the entire region from what is now Hungary to northern Greece (the "Balkans") under its thumb. In the late 400s, a series of in-migrations led to a series of new occupiers of the region, and some splintering, but around 1450 the Ottoman Empire again conquered the Balkans. Then, in the 1800s, nationalism arrived, Romantics dug up or created rationales for much smaller nations, and the whole Balkan area irretrievably broke up into small states. Ever since, such a fragmentation has been referred to as the "Balkanization" of a region.
In the case of the new database scenarios, users appear to be carrying out similar carving out of smaller territories via a multitude of databases. One presenter's real-world use case involved, on the analytics side, HP Vertica among others, and several Hadoop-based databases including MongoDB on the operational side. I conclude that, within enterprises and in public clouds, there is a strong trend towards Balkanization of databases.
That is new. Before, the practice was always for organizations to at least try to minimize the number of databases, and for major vendors to try to beat out or acquire each other. Now, I see more of the opposite, because (as memSQL's speaker noted) it makes more sense if one is trying to handle sensor-driven data to go to the Hadoop-based folks for the operational side of these tasks, and to the traditional database vendors for the analytical side. Given that the Hadoop-based side is rapidly evolving technologically and spawning new open-source vendors as a result, it is reasonable to expect users to add more database vendors than they consolidate, at least in the near term. And in the database field, it's often very hard to get rid of a database once installed.

SAP and Oracle Say Columnar Analytics Is "De Rigueur"

Both SAP and Oracle presented at the Summit. Both had a remarkably similar vision of "in-memory computing" that involved primarily columnar relational databases and in-main-memory analytics.
In the case of SAP, that perhaps is not so surprising. SAP HANA marketing has featured its in-main-memory and columnar-relational technologies for some time. Oracle’s positioning is a bit more startling: in the past, its acquisition of TimesTen and its development of columnar technologies had been treated in its marketing as a bit more of a check-list item -- yeah, we have them too, now about our incredibly feature-full, ultra-scalable traditional database ...
Perhaps the most likely answer why both Oracle and SAP were there and talking about columnar was that for flat-memory analytics, columnar's ability to compress data and hence fit it in the main-memory and/or flash tier more frequently trumps traditional row-oriented relational strengths where joins involving less than 3 compressible rows are concerned. Certainly, the use case cited above where HP Vertica's columnar technology was called into service makes the same point.
And yet, the rise in columnar's importance in the new flat-memory systems also reinforces the Balkanization of databases, if subtly. In Oracle's case, it changes the analytical product mix. In SAP's case, it reinforces the value of a relatively new entrant into the database field. In HP's case, it brings in a relatively new database from a relatively new vendor of databases that is likely to be new to the user or relatively disused before. Even within the traditionally non-Balkanized turf of analytical-database vendors some effective Balkanization is beginning to happen, and one of its key driving forces is the usefulness of columnar databases in sensor-driven-data analytics.

A Final Note: IBM and Cisco Are Missing At the Feast But Still Important

Both IBM's DB2 BLU Acceleration and Cisco Data Virtualization, imho, are important technologies in this Brave New World of flat-memory database innovation; but neither was a presenter at the Summit. That may be because the Summit was a bit Silicon-Valley heavy but I don't know for sure. I hope to give a full discussion at some point of the assets these products bring in the new database architectures, but not today. Hopefully, the following brief set of thoughts will give an idea of why I think them important.
In the case of IBM, what is now DB2 BLU Acceleration anticipated and leapfrogged the Summit in several ways, I think. Not only did BLU Acceleration optimize main-memory and to some extent flash memory analytics using the columnar approach; it also optimized the CPU itself. Among several other valuable BLU Acceleration technologies is one that promises to further speed update processing and, hence, operational-plus-analytic columnar processing. The only barrier -- and so far, it has proved surprisingly high -- is to get other vendors interested in these technologies, so that database "frameworks" which offer one set of databases for operational and another for analytical processing can incorporate "intermediate" choices between operational and analytic, or optimize operational processing yet further.
In the case of Cisco, its data virtualization capabilities offer a powerful approach to creating a framework for the new database architecture along the lines of a TP monitor -- and so much more. The Cisco Data Virtualization product is pre-built to optimize analytics and update transactions across scale-out clusters, so is well acquainted with all but the very latest Hadoop-based databases, and has excellent user interfaces. It can also serve as a front end to databases within a slot/"framework", or as a gateway to the entire database architecture. As I once wrote, this is amazing "Swiss army knife" technology -- there's a tool for everything. And for those in Europe, Denodo’s solutions are effective in this case as well.
I am sure that I am leaving out important innovations and potential technologies here. That's how rich the Summit was to a database analyst, and how exciting it should be to users.
So why am I a database analyst, again? I guess I would say, for moments like these.

Thursday, July 9, 2015

In-Memory Computing Summit 1: The Database Game, The Computing Game, Is Changing Significantly

So why was I, a database guy, attending last Monday what turned out to be the first ever In-Memory Computing Summit – a subject that, if anything, would seem to relate more to storage tiering? And why were heavy-hitter companies like Intel, SAP, and Oracle, not to mention TIBCO and Hitachi Data Systems, front and center at this conference? Answer: the surge in flash-memory capacity and price/performance compared to disk, plus the advent of the Internet of Things and the sensor-driven web, is driving a major change in the software we need, both in the area of analytics and operational processing. As one presentation put it, software needs to move from supporting Big Data to enabling Fast (but still Big) Data.

In this series of blog posts, I aim to examine these major changes as laid out in the summit, as well as their implications for databases and computing IT. In the first post, I’d like to sketch out an overall “vision”, and then in later posts explore the details of how the software to support this is beginning to arrive.

The Unit: The “Flat-Memory” System

In an architecture that can record and process massive streams of "sensor" data (including data from mobile phones and from hardware generating information for the Internet of Things) there is a premium on "stream" processing of incoming data in real time, as well as on transactional writes in addition to the reads and queries of analytical processing. The norm for systems handling this load, in the new architecture, is two main tiers: main memory RAM and "flash" or non-volatile memory/NVRAM (approximately 3 orders of magnitude). This may seem like hyperbole when we are talking about Big Data, but in point of fact one summit participant noted a real-world system using 1 TB of main memory and 1 PB of flash.

Fundamentally, flash memory is like main-memory RAM: more or less all addresses in the same tier take an equal amount of time to read or change. In that sense, both tiers of our unitary system are "flat memory", unlike disk, which has spent many years fine-tuning performance that can vary widely depending on the data's position on disk. To ease the first introduction of flash, it provided interfaces to CPUs that mimic disk accesses and therefore make flash's data access both variable and slow (compared to a flat-memory interface). Therefore, for the most part, NVRAM in our unitary system will remove this performance-clogging software and access flash in much the same way that main-memory RAM is accessed today. In fact, as Intel testified at the summit, this process is already underway at the protocol level.

The one remaining variable in performance is the slower speed of flash memory. Therefore, existing in-memory databases and the like will not optimize the new flat-memory systems out of the box. The real challenge will be to identify the amount of flash that needs to be used by the CPU to maximize performance for any given task, and then use the rest for longer-term storage, in much the same way that disk is used now. For the very largest databases, of course, disk will be a second storage tier.

The Architecture: Bridging Operational Fast and Analytical Big

Perhaps memSQL was the presenter at the conference who put the problem most pithily: in their experience, users have been moving from SQL to NoSQL, and now are moving from NoSQL towards SQL. The reason is that for deeper analytical processing of data such as social-media whose value is primarily that it's Big (e.g, much social-media data) use of SQL and relational/columnar databases is better, while for Big Data whose value is primarily that it's fresh (and therefore needs to be processed Fast) SQL software causes unacceptable performance overhead. Users will need both, and therefore will need an architecture that includes Hadoop and SQL/analytical data processing.

One approach would treat each database on either side as a "framework", which would be applied to transactional, analytical, or in-between tasks depending on its fitness for these tasks. That, to me, is a "bridge too far", introducing additional performance overhead, especially at the task assignment stage. Rather, I envision something more akin to a TP monitor, streaming sensor data to a choice among transactional databases (at present, mostly associated with Hadoop), and analytical data to a choice among other analytical databases. I view the focus of presenters such as Redis Labs on the transactional side and SAP and Oracle on the analytical side as an indication that my type of architecture is at least a strong possibility.

The Infrastructure: If This Goes On …

One science fiction author once defined most science fiction as discussing one of three “questions”: What if? If only …, and If this goes on … The infrastructure today for the new units and architecture is clearly “the cloud” – public clouds, private clouds, hybrid clouds. With the steady penetration of Hadoop into enterprises, all of these are now reasonably experienced in supporting both Hadoop and SQL data processing. And yet, if this goes on …
The Internet of Things is not limited to stationary “things”. On the contrary, many of the initial applications involve mobile smartphones and mobile cars and trucks. A recent NPR examination of car technology noted that cars are beginning to communicate not only with the dealer/manufacturer and the driver but also with each other, so that, for example, they can warn of a fender-bender around the next curve. These applications require Fast Data, real-time responses that use the cars’ own databases and flat memory for real-time sensor processing and analytics. As time goes on, these applications should become more and more frequent, and more and more disconnected from today’s clouds. If so, that would mean the advent of the mobile cloud as an alternative and perhaps dominant infrastructure for the new systems and architecture.

Perhaps this will never happen. Perhaps someone has already thought of this. If not, folks: You heard it here first.

Thoughts From a Retired Software IT Analyst