So why was I, a database guy, attending last Monday what turned out to be the first ever In-Memory Computing Summit – a subject that, if anything, would seem to relate more to storage tiering? And why were heavy-hitter companies like Intel, SAP, and Oracle, not to mention TIBCO and Hitachi Data Systems, front and center at this conference? Answer: the surge in flash-memory capacity and price/performance compared to disk, plus the advent of the Internet of Things and the sensor-driven web, is driving a major change in the software we need, both in the area of analytics and operational processing. As one presentation put it, software needs to move from supporting Big Data to enabling Fast (but still Big) Data.
In this series of blog posts, I aim to examine these major changes as laid out in the summit, as well as their implications for databases and computing IT. In the first post, I’d like to sketch out an overall “vision”, and then in later posts explore the details of how the software to support this is beginning to arrive.
The Unit: The “Flat-Memory” SystemIn an architecture that can record and process massive streams of "sensor" data (including data from mobile phones and from hardware generating information for the Internet of Things) there is a premium on "stream" processing of incoming data in real time, as well as on transactional writes in addition to the reads and queries of analytical processing. The norm for systems handling this load, in the new architecture, is two main tiers: main memory RAM and "flash" or non-volatile memory/NVRAM (approximately 3 orders of magnitude). This may seem like hyperbole when we are talking about Big Data, but in point of fact one summit participant noted a real-world system using 1 TB of main memory and 1 PB of flash.
Fundamentally, flash memory is like main-memory RAM: more or less all addresses in the same tier take an equal amount of time to read or change. In that sense, both tiers of our unitary system are "flat memory", unlike disk, which has spent many years fine-tuning performance that can vary widely depending on the data's position on disk. To ease the first introduction of flash, it provided interfaces to CPUs that mimic disk accesses and therefore make flash's data access both variable and slow (compared to a flat-memory interface). Therefore, for the most part, NVRAM in our unitary system will remove this performance-clogging software and access flash in much the same way that main-memory RAM is accessed today. In fact, as Intel testified at the summit, this process is already underway at the protocol level.
The one remaining variable in performance is the slower speed of flash memory. Therefore, existing in-memory databases and the like will not optimize the new flat-memory systems out of the box. The real challenge will be to identify the amount of flash that needs to be used by the CPU to maximize performance for any given task, and then use the rest for longer-term storage, in much the same way that disk is used now. For the very largest databases, of course, disk will be a second storage tier.
The Architecture: Bridging Operational Fast and Analytical BigPerhaps memSQL was the presenter at the conference who put the problem most pithily: in their experience, users have been moving from SQL to NoSQL, and now are moving from NoSQL towards SQL. The reason is that for deeper analytical processing of data such as social-media whose value is primarily that it's Big (e.g, much social-media data) use of SQL and relational/columnar databases is better, while for Big Data whose value is primarily that it's fresh (and therefore needs to be processed Fast) SQL software causes unacceptable performance overhead. Users will need both, and therefore will need an architecture that includes Hadoop and SQL/analytical data processing.
One approach would treat each database on either side as a "framework", which would be applied to transactional, analytical, or in-between tasks depending on its fitness for these tasks. That, to me, is a "bridge too far", introducing additional performance overhead, especially at the task assignment stage. Rather, I envision something more akin to a TP monitor, streaming sensor data to a choice among transactional databases (at present, mostly associated with Hadoop), and analytical data to a choice among other analytical databases. I view the focus of presenters such as Redis Labs on the transactional side and SAP and Oracle on the analytical side as an indication that my type of architecture is at least a strong possibility.
The Infrastructure: If This Goes On …One science fiction author once defined most science fiction as discussing one of three “questions”: What if? If only …, and If this goes on … The infrastructure today for the new units and architecture is clearly “the cloud” – public clouds, private clouds, hybrid clouds. With the steady penetration of Hadoop into enterprises, all of these are now reasonably experienced in supporting both Hadoop and SQL data processing. And yet, if this goes on …
The Internet of Things is not limited to stationary “things”. On the contrary, many of the initial applications involve mobile smartphones and mobile cars and trucks. A recent NPR examination of car technology noted that cars are beginning to communicate not only with the dealer/manufacturer and the driver but also with each other, so that, for example, they can warn of a fender-bender around the next curve. These applications require Fast Data, real-time responses that use the cars’ own databases and flat memory for real-time sensor processing and analytics. As time goes on, these applications should become more and more frequent, and more and more disconnected from today’s clouds. If so, that would mean the advent of the mobile cloud as an alternative and perhaps dominant infrastructure for the new systems and architecture.
Perhaps this will never happen. Perhaps someone has already thought of this. If not, folks: You heard it here first.