Yesterday, I had a very interesting conversation with Mike Hoskins of Pervasive about his company’s innovative DataRush product. But this blog post isn’t about DataRush; it’s about the trends in the computer industry that I think DataRush helps reveal. Specifically, it’s about why, despite the fact that disks remain much slower than main memory, most processes, even those involving terabytes of data, are CPU-bound, not I/O-bound.
Mike suggested, iirc, that around 2006 Moore’s Law – in which every 2 years, approximately, the bit capacity of a computer chip doubled, and therefore processor speed correspondingly increased – began to break down. As a result, software written to assume that increasing processor speed would cover all programming sins against performance – e.g., data lockup by security programs when you start up your PC -- is now beginning to break down, as inevitable scaling of demands on the program are not met by scaling of program performance.
However, thinking about the way in which DataRush, or Vertica, achieve higher performance – in the first case by achieving higher parallelism within a process, in the second case by slicing relational data by columns of same-type data instead of rows of different-sized data – suggests to me that more is going on than just “software doesn’t scale any more.” At the very high end of the database market, which I follow, the software munching on massive amounts of data has been unable to keep up with disk I/O for the last 15 years, at least.
Thinking about CPU processing versus I/O, in turn, reminded me of Andrew Tanenbaum, the author of great textbooks on Structured Computer Organization and Computer Networks in the late 1970s and 1980s. Specifically, in one of his later works, he asserted that the speed of networks was growing faster than the speed of processors. Let me restate that as a Law: the speed of data in motion grows faster than the speed of computing on data at rest.
The implications of Tanenbaum’s Law and the death of Moore’s Law are, I believe, that most computing will be, for the foreseeable future, CPU-bound. Think of it in terms of huge query processing that reviews multiple terabytes of data. Data storage grows by 60% a year, and we would anticipate that the time to get a certain percent of that data off the disk to send to main memory would be greater each year, if networking speed was growing as fast as processor speed, and therefore slower than stored data. Instead, even today’s basic SATA drives can deliver multiple gigabytes/second – faster than the clock speeds of today’s microprocessors. To me, this says that disks are shoving the data at processors faster than they can process it. And the death of Moore’s Law just makes things worse.
The implications are that the fundamental barriers to scaling computing are not processor geometry, but the ability to parallelize the two key “at rest” tasks of the processor: storing the data in main memory, and operating on it. In order to catch up to storage growth and network speed growth, we have to throw as many processors as we can at a task in parallel. And that, in turn, suggests that the data-flow architecture needs to be looked at again.
The concept of today’s architecture is multiple processors running multiple processes in parallel, each process operating on a mass of (sometimes shared) data. The idea of the data-flow architecture is to split processes into unitary tasks, and then flow parallel streams of data under processors which carry out each of those tasks. The distinction here is that in one approach, the focus is in parallelizing multi-task processes that the computer carries out on a chunk of data at rest; in the other the focus is on parallelizing the same task carried out on a stream of data.
Imagine, for instance, that we were trying to find the best salesperson in the company in the last month, with a huge sales database not already prepared for the query. In today’s approach, one process would load the sales records into main memory in chunks, and for each chunk, maintain a running count of sales for every salesman in the company. Yes, the running count is to some extent parallelized. But the record processing is often not.
Now imagine that multiple processors are assigned the task of looking at each record as it arrives, with each processor keeping a running count for one salesperson. Not only are we speeding up the access to the data uploaded from disk by parallelizing that; we are also speeding up the computation of running counts beyond that of today’s architecture, by having multiple processors performing the count on multiple records at the same time. So the two key bottlenecks involving data at rest – accessing the data, and performing operations on the data – are lessened.
Note also that the immediate response to the death of Moore’s Law is the proliferation of multi-core chips – effectively, 4-8 processors on a chip. So a simple way of imposing a data-flow architecture over today’s approach is to have the job scheduler in a symmetric multiprocessing architecture break down processes into unitary tasks, then fire up multiple cores for each task, operating on shared memory. If I understand Mike Hoskins, this is the gist of DataRush’s approach.
But I would argue that if I am correct, programmers also need to begin to think of their programs as optimizing processing of data flows. One could say that event-driven programming does something similar; but so far, that’s typically a special case, not an all-purpose methodology or tool.
Recently, to my frustration, a careless comment got me embroiled again in the question of whether Java or Ruby or whatever is a high-level language – when I strongly feel that these do poorly (if examples on Wikipedia are representative) at abstracting data-management operations and therefore are far from ideal. Not one of today’s popular dynamic, functional, or object-oriented programming languages, as far as I can tell, thinks about optimizing data flow. Is it time to merge them with LabVIEW or VEE?
No comments:
Post a Comment