In a somewhat odd dream the other night, I imagined myself
giving potloads of targeted money to some major university a la Bill Gates, and
I was choosing to spend it on basic research in computing. What, I asked myself, was worthy of basic
research that had not been probed to death already?
Obvious candidates relevant to the frontiers of the
computing market these days include artificial intelligence and speech
recognition. These, however, imho are
well-plowed fields already, having received major basic-research attention at
least since Minsky in the 1970s (AI) and government need for translations of
foreign documents (written-speech parsing) in the Cold War of the 1960s. The result of 40+ years of basic research has
been an achingly slow translation of these into something useful (Watson, Siri,
and their ilk), so that most of the action now is in applied rather than basic
research. So, I said in my dream, I’m
not going down that rathole.
Now, I really am not up to date on what universities are
really doing in basic computer research, but I do get some gleanings from some
of the prizes awarded to Comp Sci professors at a couple of universities. So I would like to suggest two new areas
where I wonder if basic research could really make a difference in the medium
term, and allow computing products to do much better. Here they are:
1.
Recursive computing; and
2.
Causality in analytics.
Recursive Computing
Back in the early 1970s, “theory of algorithms and
computing” provided some very valuable insights into the computation times of
many key computer tasks, such as sorting and solving a matrix. One hot topic of ongoing research was
figuring out whether tasks done in parallel (non-deterministically) in less
than n**3 [n cubed] x some constant (where n is the number of number of data
points used by the task) can also be done sequentially in that amount of
time. In math jargon, this was known as
the P (olynomial) = N(on-deterministic)P Problem. Note that at the time, a task that must take
exponential time was effectively undoable for all but the smallest cases.
It turned out that several useful tasks fit in the category
of those that might possibly be solvable.
For example, the traveling salesman problem seeks to minimize travel
time for any possible route between n points.
If and only if P=NP, then the traveling salesman problem could be done
for any case in less than order of n**3 or O(n**3) time. The last time I checked, in the 1980s, the
P=NP problem had not been solved, but “good enough” approximations to answers
had been identified that got close enough to the right solution to be somewhat
satisfactory.
Recursion is, briefly, the solution of a problem of size n
by combining the solutions of the same problem of smaller sizes, say, n-1 or
n/2. For example, one can solve a
sorting problem of size n by sorting two lists of size n/2 and then running a
comparison of list 1 and list 2, piece by piece. Each list can, in turn, be solved by sorting 4
lists of size n/4 and combining, and so on down to lists of size 2. If all of this is done sequentially, then the
time is O (n log n). If it is done in
parallel, however, with n processors, then the time is O (log n). That’s a big speedup through parallelism –
but it’s not parallelism as P=NP means it.
In practical terms, you simply can’t pack the processors in a tree
structure next to each other without the length of time to talk from one
processor to another becoming longer and longer. I estimated the crossover point when
parallelism become no longer of use at about 10 ** 6 (a million) to 10 ** 9 (a
billion) processors.
In the years since the 1980s, this kind of analysis seemed
irrelevant to speeding up computing.
Other techniques, such as pipelining and massively parallel (but not
recursive) arrays of PCs seemed to offer better ways to gain performance. But two recent developments suggest to me
that it may be time to dust off and modernize recursion research:
1.
Fast Data depends on Apache Spark, and the model
of Apache Spark is of one processor per piece of the data stream applied to a
humongous flat-memory data storage architecture (a cluster of PCs). In other words, we can achieve real-time
transaction processing and initial analytics by completely parallel application
of thousands to millions of local PCs followed by recombination of the results. There seems a good case to be made that “divide
and conquer” here will yield higher performance than mindless pipelining.
2.
Quantum computing has apparently proved its
ability to handle computer operations and solve problems. As I understand it, quantum computing data
storage (via qubits) is completely parallel, and is not bounded by distance
(what Einstein apparently referred to as “spooky [simultaneous] action [by two
entangled quantum objects] at a distance [that apparently could be quite large]”. Or, to put it another way, in quantum
computing, P=NP.
Whether this means recursion will be useful again, I don’t
know. But it seems to me worth the
effort.
Causality In Analytics
One of the more embarrassing failures of statistics in
recent years was in handling the tobacco controversy. It seemed plain from the data that tobacco
smoke was causing first-hand and second-hand data, but the best statistics
could apparently do was to establish a correlation, which could mean that
tobacco smoke caused cancer, or that genetic tendency to cancer caused one to
smoke, or that lung cancer and tobacco use increased steadily because of other
factors entirely. It was only when the
biological effects of nicotine on the lungs were traced that a clear causal path
could be projected. In effect,
statistics could say nothing useful about causality without a separate
scientific explanation.
A recent jaunt through Wikipedia in search of “causality”
confirmed my concerns about the present state of statistics’ ability to
identify causality. There were plenty of
philosophical attempts to say what causality was, but there was no clear
statistical method mentioned that allowed early identification of causality. Moreover, there seemed to be no clear way of
establishing anything between correlation/linear regression and pure
causality.
If any modern area would seem to offer a promise of
something better, it would be business analytics. After all, the name of the game today in most
cases is understanding the customer, in aggregate and individually. That understanding also seeks to foster a
long-term relationship with key customers.
And therefore, distinguishing between the customer that spends a lot
but, once dissatisfied, leaves and the customer who spends less but is more
likely to be a long-term sell (as a recent Sloan Management Review article
pointed out, iirc) can be mission-critical.
The reason basic research would seem likely to yield new
insights into causality is that one key component of doing better is “domain
knowledge”. Thus, Amazon recently noted
that I was interested in climate change, and then proceeded to recommend not
one but two books by climate change deniers.
Had the analytics been able to use something like IBM’s Watson, they
might have deduced that I was interested in climate change because I was highly
concerned about it, not because I was a paranoid conspiracy theorist who
thought climate change was a plot to rob me of my hard-earned money. And basic research that could establish
causal models better should also be able to enhance domain knowledge in order
to provide the ability to establish the degree of confidence in causality that
is appropriate in a particular case, and avoid data-mining bias (the fact that trying
out too many models will increase the chances of choosing an untrue one).
Envoi
I expect that neither of these basic research topics will
actually ever be plumbed. Still, that’s my
holidays wish list. And now I can stop
worrying about that dream.