Tuesday, August 31, 2010

1010data: Operating Well in a Parallel Universe?

Recently, I received a blurb from a company named 1010data, claiming that its personnel had been doing columnar databases for more than 30 years. As someone who was integrally involved at a technical level in the big era of database theory development (1975-1988), when everything from relational to versioning to distributed to inverted-list technology (the precursor to much of today’s columnar technology) first saw the light, I was initially somewhat skeptical. This wariness was heightened by receiving marketing claims that performance in data warehousing was better than not only relational databases but also than competitors’ columnar databases, even though 1010data does little in the way of indexing; and this performance improvement applied not only to ad-hoc queries with little discernable pattern, but also to many repetitive queries for which index-style optimization was apparently the logical thing to do.

1010data’s marketing is not clear as to why this should be so; but after talking to them, and reading their technical white paper, I have come up with a theory as to why it might be so. The theory goes like this: 1010data is not living in the same universe.

That sounds drastic. What I mean by this is, while the great mass of database theory and practice went one way, 1010data went another, back in the 1980s, and by now, in many cases, they really aren’t talking the same language. So what follows is an attempt to recast 1010data’s technology in terms familiar to me. Here’s the way I see it:

Since the 1980s, people have been wrestling with the problem of read and write locks on data. The idea is that if you decide to update a datum while another person is attempting to read it, each of you will see a different value, or the other person can’t predict which value he/she will see. To avoid this, the updater can block all other access via a write lock – which in turn slows down the other person drastically; or the “query from hell” can block updaters via a read lock on all data. In a data warehouse, updates are held and then rushed through at certain times (end of day/week) in order to avoid locking problems. Columnar databases also sometimes provide what is called “versioning”, in which previous values of a datum are kept around, so that the updater can operate on one value while the reader can operate on another.

1010data provides a data warehouse/business intelligence solution as a remote service – the “database as a service” variant of SaaS/public cloud. However, 1010data’s solution does not start by worrying about locking. Instead, it worries about how to provide each end user with a consistent “slice of time” database of his/her own. It appears to do this as follows: all data is divided up into what they call “master” tables (as in “master data management” of customer and supplier records), which are smaller, and time-associated/time-series “transactional” tables, which are the really large tables.

Master tables are more rarely changed, and therefore a full copy of the table after each update (really, a “burst” of updates) can be stored on disk, and loaded into main memory if needed by an end user, with little storage and processing overhead. This isn’t feasible for the transactional tables; but 1010data sees old versions of these as integral parts of the time series, not as superseded data; so the actual amount of “excess” data “appended” to a table, if maximum session length for an end user is a day, is actually small in all realistic circumstances. As a result, two versions of a transactional table include a pointer to a common ancestor plus a small “append”. That is, the storage overhead of additional versioning data is actually small compared to some other columnar technologies, and not that much more than row-oriented relational databases.

Now the other shoe drops, because, in my very rough approximation, versioning entire tables instead of particular bits of data allows you to keep those bits of data pretty much sequential on disk – hence the lack of need for indexing. It is as if each burst of updates comes with an online reorganization that restores the sequentiality of the resulting table version, so that reads during queries are potentially almost eliminating seek time. The storage overhead means that more data must be loaded from disk; but that’s more than compensated for by eliminating the need to jerk from one end of the disk to the other in order to inhale all needed data.

So here’s my take: 1010data’s claim to better performance, as well as to competitive scalability, is credible. Since we live in a universe in which indexing to minimize disk seek time plus minimizing added storage to minimize disk accesses in the first place allows us to push against the limits of locking constraints, we are properly appreciative of the ability of columnar technology to provide additional storage savings and bit-mapped indexing to store more data in memory. Since 1010data lives in a universe in which locking never happens and data is stored pretty sequentially, it can happily forget indexes and squander a little disk storage and still perform better.

1010data Loves Sushi
At this point, I could say that I have summarized 1010data’s technical value-add, and move on to considering best uses. However, to do that would be to ignore another way that 1010data does not operate in the same universe: it loves raw data. It would prefer to operate on data before any detection of errors and inconsistencies, as it views these problems as important data in their own right.

As a strong proponent of improving the quality of data provided to the end user, I might be expected to disagree strongly. However, as a proponent of “data usefulness”, I feel that the potential drawbacks of 1010data’s approach are counterbalanced by some significant advantages in the real world.

In the first place, 1010data is not doctrinaire about ETL (Extract, Transform, Load) technology. Rather, 1010data allows you to apply ETL at system implementation time or simply start with an existing “sanitized” data warehouse (although it is philosophically opposed to these approaches), or apply transforms online, at the time of a query. It’s nice that skipping the transform step when you start up the data warehouse will speed implementation. It’s also nice that you can have the choice of going raw or staying baked.

In the second place, data quality is not the only place where the usefulness of data can be decreased. Another key consideration is the ability of a wide array of end users to employ the warehoused data to perform more in-depth analysis. 1010data offers a user interface using the Excel spreadsheet metaphor and supporting column/time-oriented analysis (as well as an Excel add-in), thus providing better rolling/ad-hoc time-series analysis to a wider class of business users familiar with Excel. Of course, someone else may come along and develop such a flexible interface, although 1010data would seem to have a lead as of now; but in the meanwhile, the wider scope and additional analytic capabilities of 1010data appear to compensate for any problems with operating on incorrect data – and especially when there are 1010data features to ensure that analyses take into account possible incorrectness.

Caveat
To me, some of continuing advantages of 1010data’s approach depend fundamentally on the idea that users of large transactional tables require ad-hoc historical analysis. To put it another way, if users really don’t need to keep historical data around for more than an hour in their databases, and require frequent updates/additions for “real-time analysis” (or online transaction processing), then tables will require frequent reorganizing and will include a lot of storage-wasting historical data, so that 1010data’s performance advantages will decrease or vanish.

However, there will always be ad-hoc, in-depth queryers, and these are pretty likely to be interested in historical analysis. So while 1010data may or may not be the be-all, end-all data-warehousing database for all verticals forever, it is very likely to offer distinct advantages for particular end users, and therefore should always be a valuable complement to a data warehouse that handles vanilla querying on a “no such thing as yesterday” basis.

Conclusion
Not being in the mainstream of database technology does not mean irrelevance; not being in the same database universe can mean that you solve the same problems better. It appears that taking the road less travelled has allowed 1010data to come up with a new and very possibly improved solution to data warehousing, just as inverted list resurfaced in the last few years to provide new and better technology in columnar databases. And it is not improbable that 1010data can continue to maintain any performance and ad-hoc analysis advantages in the next few years.

Of course, proof of these assertions in the real world is an ongoing process. I would recommend that BI/data warehousing users in large enterprises in all verticals kick the tires of 1010data – as noted, testbed implementation is pretty swift – and then performance test it and take a crack at the really tough analyst wish lists. To misquote Santayana, those who do not analyze history are condemned to repeat it – and that’s not good for the bottom line.

Monday, August 23, 2010

IBM Acquired SPSS, Intel Acquires McAfee: More Problems Than Meet the Eye?

As Han Solo noted repeatedly in Star Wars – often mistakenly – I’ve got a bad feeling about this.

Last year, IBM acquired SPSS. Since then, IBM has touted the excellence of SPSS’ statistical capabilities, and its fit with the Cognos BI software. Intel has just announced that it will acquire McAfee. Intel touts the strength of McAfee’s security offerings, and the fit with Intel’s software strategy. I don’t quarrel with the fit, nor with the strengths that are cited. But it seems to me that both IBM and Intel may – repeat, may – be overlooking problems with their acquisitions that will limit the value-add to the customer of the acquisition.

Let’s start with IBM and SPSS. Back in the 1970s, when I was a graduate student, SPSS was the answer for university-based statistical research. Never mind the punched cards; SPSS provided the de-facto standard software for the statistical analysis typical in those days, such as regression and t tests. Since then, it has beefed up its “what if” predictive analysis, among other things, and provided extensive support for business-type statistical research, such as surveys. So what’s not to like?

Well, I was surprised to learn, by hearsay, that among psychology grad students, SPSS was viewed as not supporting (or not providing an easy way to do) some of the advanced statistical functions that researchers wanted to do, such as scatter plots, compared to SAS or Stata. This piqued my curiosity; so I tried to get onto SPSS’ web site (www.spss.com) on a Sunday afternoon to do some research in the matter. After several waits for a web page to display of 5 minutes or so, I gave up.

Now, this may not seem like a big deal. However, selling is about consumer psychology, and so good psychology research tools really do matter to a savvy large-scale enterprise. If SPSS really does have some deficits in advanced psychology statistical tools, then it ought to at least support the consumer by providing rapid web-site access, and it or IBM ought to at least show some signs of upgrading the in-depth psychology research capabilities that were, at least for a long time, SPSS’ “brand.” But if there were any signs of “new statistical capabilities brought to SPSS by IBM” or “upgrades to SPSS’ non-parametric statistics in version 19”, they were not obvious to me from IBM’s web site.

And, following that line of conjecture, I would be quite unconcerned, if I were SAS or Stata, that IBM had chosen to acquire SPSS. On the contrary, I might be pleased that IBM had given them lead time to strengthen and update their own statistical capabilities, so that whatever happened to SPSS sales, researchers would continue to require SAS as well as SPSS. It is even not out of the bounds of possibility to conjecture that SPSS will make IBM less of a one-stop BI shop than before, because it may open the door to further non-SPSS sales, if SPSS falls further behind in advanced psych-stat tools – or continues to annoy the inquisitive customer with 5-minute web-site wait times.

Interestingly, my concern about McAfee also falls under the heading of “annoying the customer.” Most of those who use PCs are familiar with the rivalry between Symantec’s Norton and McAfee in combating PC viruses and the like. For my part, my experience (and that of many of the tests by PC World) was that, despite significant differences, both did their job relatively well, and that one could not lose by staying with either, or by switching from the one to the other.

That changed, about 2-3 years ago. Like many others, I chose not to move to Vista, but stayed with XP. At about this time, I began to take a major hit in performance and startup time. Even after I ruthlessly eliminated all startup entries except McAfee (which refused to stay eliminated), startup took in the 3-5 minute range, performance in the first few minutes after the desktop displayed was practically nil, and performance after that (especially over the Web) was about half what it should have been. Meanwhile, when I switched to the free Comcast version of McAfee, stopping their automatic raiding of my credit card for annual renewals was like playing Whack-a-Mole, and newer versions increasingly interrupted processing at all times either to request confirmations of operations or to carry out unsolicited scans that slowed performance to a crawl in the middle of work hours.

Well, you say, business as usual. Except that Comcast switched to Norton last year, and, as I have I downloaded the new security software to each of five new/old XP/Win7 PCs/laptops, the difference has been dramatic in each case. No more prompts demanding response; no more major overhead from scans; startup clearly faster, and faster still once I removed stray startup entries via Norton; performance on the Web and off the Web close to performance without security software. And PC World assures me that there is still no major difference in security between Norton and McAfee.

Perhaps I am particularly unlucky. Perhaps Intel, as it attempts to incorporate McAfee’s security into firmware and hardware, will fix the performance problems and eliminate the constant nudge-nudge wink-wink of McAfee’s response-demanding reminders. It’s just that, as far as I can see from the press release, Intel does not even mention “We will use Intel technology to speed up your security software.” Is this a good sign? Not by me, it isn’t.

So I conjecture, again, that Intel’s acquisition of McAfee may be great news – for Symantec. What happens in consumer tends to bleed over into business, so problems with consumer performance may very well affect business users’ experience of their firewalls, as well; in which case, this would give Symantec the chance to make hay while Intel shines a light on McAfee’s performance, and to cement its market to such a point that Intel will find it difficult to push a one-stop security-plus-hardware shop on its customers.

Of course, none of my evidence is definitive, and many other things may affect the outcome. However, if I were a business customer, I would be quite concerned that, along with the value-add of the acquisitions of SPSS by IBM and McAfee by Intel, may come a significant value-subtract.