Tuesday, December 15, 2015

Two Humble Suggestions For Basic Research in Computer Science

In a somewhat odd dream the other night, I imagined myself giving potloads of targeted money to some major university a la Bill Gates, and I was choosing to spend it on basic research in computing.  What, I asked myself, was worthy of basic research that had not been probed to death already?

Obvious candidates relevant to the frontiers of the computing market these days include artificial intelligence and speech recognition.  These, however, imho are well-plowed fields already, having received major basic-research attention at least since Minsky in the 1970s (AI) and government need for translations of foreign documents (written-speech parsing) in the Cold War of the 1960s.  The result of 40+ years of basic research has been an achingly slow translation of these into something useful (Watson, Siri, and their ilk), so that most of the action now is in applied rather than basic research.  So, I said in my dream, I’m not going down that rathole.

Now, I really am not up to date on what universities are really doing in basic computer research, but I do get some gleanings from some of the prizes awarded to Comp Sci professors at a couple of universities.  So I would like to suggest two new areas where I wonder if basic research could really make a difference in the medium term, and allow computing products to do much better.  Here they are:
1.       Recursive computing; and
2.       Causality in analytics.

Recursive Computing

Back in the early 1970s, “theory of algorithms and computing” provided some very valuable insights into the computation times of many key computer tasks, such as sorting and solving a matrix.  One hot topic of ongoing research was figuring out whether tasks done in parallel (non-deterministically) in less than n**3 [n cubed] x some constant (where n is the number of number of data points used by the task) can also be done sequentially in that amount of time.  In math jargon, this was known as the P (olynomial) = N(on-deterministic)P Problem.  Note that at the time, a task that must take exponential time was effectively undoable for all but the smallest cases.

It turned out that several useful tasks fit in the category of those that might possibly be solvable.  For example, the traveling salesman problem seeks to minimize travel time for any possible route between n points.  If and only if P=NP, then the traveling salesman problem could be done for any case in less than order of n**3 or O(n**3) time.  The last time I checked, in the 1980s, the P=NP problem had not been solved, but “good enough” approximations to answers had been identified that got close enough to the right solution to be somewhat satisfactory.

Recursion is, briefly, the solution of a problem of size n by combining the solutions of the same problem of smaller sizes, say, n-1 or n/2.  For example, one can solve a sorting problem of size n by sorting two lists of size n/2 and then running a comparison of list 1 and list 2, piece by piece.  Each list can, in turn, be solved by sorting 4 lists of size n/4 and combining, and so on down to lists of size 2.  If all of this is done sequentially, then the time is O (n log n).  If it is done in parallel, however, with n processors, then the time is O (log n).  That’s a big speedup through parallelism – but it’s not parallelism as P=NP means it.  In practical terms, you simply can’t pack the processors in a tree structure next to each other without the length of time to talk from one processor to another becoming longer and longer.  I estimated the crossover point when parallelism become no longer of use at about 10 ** 6 (a million) to 10 ** 9 (a billion) processors.

In the years since the 1980s, this kind of analysis seemed irrelevant to speeding up computing.  Other techniques, such as pipelining and massively parallel (but not recursive) arrays of PCs seemed to offer better ways to gain performance.  But two recent developments suggest to me that it may be time to dust off and modernize recursion research:
1.       Fast Data depends on Apache Spark, and the model of Apache Spark is of one processor per piece of the data stream applied to a humongous flat-memory data storage architecture (a cluster of PCs).  In other words, we can achieve real-time transaction processing and initial analytics by completely parallel application of thousands to millions of local PCs followed by recombination of the results.  There seems a good case to be made that “divide and conquer” here will yield higher performance than mindless pipelining. 
2.       Quantum computing has apparently proved its ability to handle computer operations and solve problems.  As I understand it, quantum computing data storage (via qubits) is completely parallel, and is not bounded by distance (what Einstein apparently referred to as “spooky [simultaneous] action [by two entangled quantum objects] at a distance [that apparently could be quite large]”.  Or, to put it another way, in quantum computing, P=NP.

Whether this means recursion will be useful again, I don’t know.  But it seems to me worth the effort.

Causality In Analytics

One of the more embarrassing failures of statistics in recent years was in handling the tobacco controversy.  It seemed plain from the data that tobacco smoke was causing first-hand and second-hand data, but the best statistics could apparently do was to establish a correlation, which could mean that tobacco smoke caused cancer, or that genetic tendency to cancer caused one to smoke, or that lung cancer and tobacco use increased steadily because of other factors entirely.  It was only when the biological effects of nicotine on the lungs were traced that a clear causal path could be projected.  In effect, statistics could say nothing useful about causality without a separate scientific explanation.
A recent jaunt through Wikipedia in search of “causality” confirmed my concerns about the present state of statistics’ ability to identify causality.  There were plenty of philosophical attempts to say what causality was, but there was no clear statistical method mentioned that allowed early identification of causality.  Moreover, there seemed to be no clear way of establishing anything between correlation/linear regression and pure causality. 

If any modern area would seem to offer a promise of something better, it would be business analytics.  After all, the name of the game today in most cases is understanding the customer, in aggregate and individually.  That understanding also seeks to foster a long-term relationship with key customers.  And therefore, distinguishing between the customer that spends a lot but, once dissatisfied, leaves and the customer who spends less but is more likely to be a long-term sell (as a recent Sloan Management Review article pointed out, iirc) can be mission-critical.

The reason basic research would seem likely to yield new insights into causality is that one key component of doing better is “domain knowledge”.  Thus, Amazon recently noted that I was interested in climate change, and then proceeded to recommend not one but two books by climate change deniers.  Had the analytics been able to use something like IBM’s Watson, they might have deduced that I was interested in climate change because I was highly concerned about it, not because I was a paranoid conspiracy theorist who thought climate change was a plot to rob me of my hard-earned money.  And basic research that could establish causal models better should also be able to enhance domain knowledge in order to provide the ability to establish the degree of confidence in causality that is appropriate in a particular case, and avoid data-mining bias (the fact that trying out too many models will increase the chances of choosing an untrue one).

Envoi


I expect that neither of these basic research topics will actually ever be plumbed.  Still, that’s my holidays wish list.  And now I can stop worrying about that dream.