So why was I, a database guy, attending last Monday what
turned out to be the first ever In-Memory Computing Summit – a subject that, if
anything, would seem to relate more to storage tiering? And why were heavy-hitter companies like
Intel, SAP, and Oracle, not to mention TIBCO and Hitachi Data Systems, front
and center at this conference?
Answer: the surge in flash-memory
capacity and price/performance compared to disk, plus the advent of the
Internet of Things and the sensor-driven web, is driving a major change in the
software we need, both in the area of analytics and operational
processing. As one presentation put it,
software needs to move from supporting Big Data to enabling Fast (but still
Big) Data.
In this series of blog posts, I aim to examine these major
changes as laid out in the summit, as well as their implications for databases
and computing IT. In the first post, I’d
like to sketch out an overall “vision”, and then in later posts explore the
details of how the software to support this is beginning to arrive.
The Unit: The “Flat-Memory” System
In an architecture that can record and process massive
streams of "sensor" data (including data from mobile phones and from
hardware generating information for the Internet of Things) there is a premium
on "stream" processing of incoming data in real time, as well as on
transactional writes in addition to the reads and queries of analytical
processing. The norm for systems
handling this load, in the new architecture, is two main tiers: main memory RAM and "flash" or
non-volatile memory/NVRAM (approximately 3 orders of magnitude). This may seem like hyperbole when we are
talking about Big Data, but in point of fact one summit participant noted a
real-world system using 1 TB of main memory and 1 PB of flash.
Fundamentally, flash memory is like main-memory RAM: more or less all addresses in the same tier
take an equal amount of time to read or change.
In that sense, both tiers of our unitary system are "flat
memory", unlike disk, which has spent many years fine-tuning performance
that can vary widely depending on the data's position on disk. To ease the first introduction of flash, it
provided interfaces to CPUs that mimic disk accesses and therefore make flash's
data access both variable and slow (compared to a flat-memory interface). Therefore, for the most part, NVRAM in our
unitary system will remove this performance-clogging software and access flash
in much the same way that main-memory RAM is accessed today. In fact, as Intel testified at the summit,
this process is already underway at the protocol level.
The one remaining variable in performance is the slower
speed of flash memory. Therefore,
existing in-memory databases and the like will not optimize the new flat-memory
systems out of the box. The real
challenge will be to identify the amount of flash that needs to be used by the
CPU to maximize performance for any given task, and then use the rest for
longer-term storage, in much the same way that disk is used now. For the very largest databases, of course,
disk will be a second storage tier.
The Architecture: Bridging
Operational Fast and Analytical Big
Perhaps memSQL was the presenter at the conference who put
the problem most pithily: in their
experience, users have been moving from SQL to NoSQL, and now are moving from
NoSQL towards SQL. The reason is that
for deeper analytical processing of data such as social-media whose value is
primarily that it's Big (e.g, much social-media data) use of SQL and
relational/columnar databases is better, while for Big Data whose value is
primarily that it's fresh (and therefore needs to be processed Fast) SQL
software causes unacceptable performance overhead. Users will need both, and therefore will need
an architecture that includes Hadoop and SQL/analytical data processing.
One approach would treat each database on either side as a
"framework", which would be applied to transactional, analytical, or
in-between tasks depending on its fitness for these tasks. That, to me, is a "bridge too far",
introducing additional performance overhead, especially at the task assignment
stage. Rather, I envision something more
akin to a TP monitor, streaming sensor data to a choice among transactional
databases (at present, mostly associated with Hadoop), and analytical data to a
choice among other analytical databases.
I view the focus of presenters such as Redis Labs on the transactional
side and SAP and Oracle on the analytical side as an indication that my type of
architecture is at least a strong possibility.
The Infrastructure: If This Goes On …
One science fiction author once defined most science fiction
as discussing one of three “questions”:
What if? If only …, and If this goes on … The infrastructure today for the new units
and architecture is clearly “the cloud” – public clouds, private clouds, hybrid
clouds. With the steady penetration of
Hadoop into enterprises, all of these are now reasonably experienced in
supporting both Hadoop and SQL data processing.
And yet, if this goes on …The Internet of Things is not limited to stationary “things”. On the contrary, many of the initial applications involve mobile smartphones and mobile cars and trucks. A recent NPR examination of car technology noted that cars are beginning to communicate not only with the dealer/manufacturer and the driver but also with each other, so that, for example, they can warn of a fender-bender around the next curve. These applications require Fast Data, real-time responses that use the cars’ own databases and flat memory for real-time sensor processing and analytics. As time goes on, these applications should become more and more frequent, and more and more disconnected from today’s clouds. If so, that would mean the advent of the mobile cloud as an alternative and perhaps dominant infrastructure for the new systems and architecture.
Perhaps this will never happen. Perhaps someone has already thought of
this. If not, folks: You heard it here first.
No comments:
Post a Comment