Monday, January 2, 2012

The Other BI: Pervasive DataRush and Parallel Data Streaming

This blog post highlights a software company and technology that I view as potentially useful to organizations investing in business intelligence (BI) and analytics in the next few years. Note that, in my opinion, this company and solution are not yet typically “top of the mind” when we talk about BI today.

The Importance of the DataRush Software Technology to BI
The basic idea of DataRush, as I understand it, is to superimpose a “parallel dataflow” model on top of typical data management code, in order to improve the performance (and therefore scalability) of the data-processing operations used by typical large-scale applications. Right now, your processing in general and your BI querying in particular are typically done either by “query optimization” within a “database engine” that takes one stream of “basic” instructions and parallelizes it by figuring out (more or less) how to run each step in parallel on separate chunks of data, or by programmer code that attempts a wide array of strategies for speeding things up further, ranging from “delayed consistency” (in cases where lots of updates are also happening) to optimization for the special case of unstructured data (e.g., files consisting of videos or pictures). “Parallel dataflow” instead requires that particular types of querying/updates be separated into multiple streams depending on the type of operation. This is done up front, as a specification by a programmer of a dataflow “model” that applies across all applications with the same types of operation.

There is good reason to believe, as I do, that this approach can yield major, ongoing performance improvements in a wide variety of BI areas. In the first place, the approach should deliver performance improvements over and beyond existing engines and special-case solutions, and not force you into supporting yet another alternate technology path. The idea of dataflow is not new, but for various historical reasons this variant has not been the primary focus of today’s database engines, and so the job of retrofitting to support “parallel dataflow” is nowhere near completion in most database engines. That means that, potentially, using “parallel dataflow” on top of these engines can squeeze out additional parallelism, due to the increased number and sophistication of the streams, especially on massively parallel architectures such as today’s multicore-chip server farms.

At the same time, the increasing importance of unstructured and semi-structured data has created something of a “green field” in processing this data, especially in areas such as health care’s handling of CAT scans, vendors streaming video over the Web, and everyone querying social-media Big Data. Where existing data-processing techniques are not set in concrete, “parallel dataflow” is very likely to yield outsized performance gains when applied, because it operates at a greater level of abstraction than most database engines and special-case file handlers like Hadoop/MapReduce, and so can be customized more effectively to new data transaction mixes and data types.

There is always a caveat in dealing with “new” software technologies that are really an evolution of techniques whose time has come. In this case, the caveat concerns the fact that, as noted, programmers or system designers need to specify the dataflows, rather than the database engine, and this dataflow “model” is not a general case for all data processing. That, in turn, means that at least some programmers need to understand dataflows on an ongoing basis.

It is my guess that this is a task that users of “parallel dataflow” and DataRush should embrace. There is a direct analogy here between agile development and DataRush-based development. The usefulness of agile development lies not only in the immediate speedup of application development, but also in the way that agile development methodologies embed end-user knowledge in the development organization, with all sorts of positive follow-on effects on the organization as a whole. In the same way, setting up dataflows for a particular application leads typically to a new way of thinking about applications as dataflows, and that improves the quality and often the performance of every application that the organization handles, whether it is optimizable by “parallel dataflow” or not.

In other words, in my opinion, developers’ knowledge of data-driven programming is increasingly inadequate in many cases. Automating this programming in the database engine and user interface can only do so much to make up for the lack. It is more than worth the pain of additional ongoing dataflow programming to reintroduce the skill of programming based on a data “model” to today’s generation of developers.

The Relevance of Pervasive Software to BI
Let me state my conclusion up front: I view investment in Pervasive Software’s DataRush technology as every bit as safe as investment in an IBM or Oracle product. Why do I say this?

Let’s start with Pervasive Software’s “DNA.” Originally, more than 15 years ago, I ran across Pervasive Software as a spin-off of Novell’s Windows database of the 1980s. Over time, as databases almost always do, the solution that has become Pervasive PSQL has provided a stable source of ongoing revenue. More importantly, it has centered Pervasive Software from the very start in Windows, PC-server, and distributed database technologies servicing the SMB/large-enterprise-department market. In other words, Pervasive has demonstrated over 15 years of ups and downs that it is nowhere near failure, and that it knows the world even of the Windows/PC-server side of the Global 10,000 quite well.

At the same time, having followed the SMB/departmental market (and especially the database side) for more than 15 years, I am struck by the degree to which, now, software technologies move “bottom-up” from that market to the large enterprise market. Software as a Service, the cloud, and now some of the latest capabilities in self-service and agile BI are all taking their cue from SMB-style operations and technologies. Thus, in the Big Data market in particular and in data management in general, Pervasive is one leading-edge vendor well in tune with an overall movement of SMB-style open-source and other solutions centered around the cloud and Web data.

I therefore see the risks of Pervasive Software DataRush vendor lock-in and technology irrelevance over the next few years as minimal. And, of course, participation in the cloud open-source “movement” means crowd-sourced support as effective as IT’s existing open-source software product support.

Aren’t there any risks? Well, yes, in my opinion, there are the product risks of any technology, i.e., that technology will evolve to the point where “parallel dataflow” or its equivalent is better integrated into another company’s product. However, if that happens, dollars to doughnuts there will be a straightforward path from a DataRush dataflow model to that product’s data-processing engine – because the open-source market, at the very least, will provide it.

Potential Uses of DataRush for IT

The obvious immediate uses of DataRush in IT are, as Pervasive Software has pointed out, in Big Data querying and pharmaceutical-company grid searches. In the case of Big Data, DataRush front-ending Hadoop for both public and hybrid clouds is an interesting way to both reduce the number of instances of “eventual consistency” turning into “never consistent” and to increase the depth of analytics by allowing a greater amount of Big Data to be processed in a given length of time, either on-site at the social-media sites or in-house as part of handling the “fire hose” of just-arrived Big Data from the public cloud.

However, I don’t view these as the most important use cases for IT to keep an eye on. Ideally, IT could infuse the entire Windows/PC-server part of its enterprise architecture with “parallel dataflow” smarts, for a semi-automatic ongoing data-processing performance boost. Failing that, IT should target the Windows/small-server information handling in which increased depth of analytics of near-real-time data is of most importance – e.g., agile BI in general.

These suggestions come with the usual caveats. This technology is more likely than most to require initial experimentation by internal R&D types, and some programmer training, as well. Finding the initial project with the best immediate value-add is probably not going to be as straightforward as in some other cases, as the exact performance benefit of this technology for any kind of database architecture is apparently not yet fully predictable. Effectively, these caveats say: if you don’t have the IT depth or spare cash to experiment, just point the technology at a nagging BI problem and odds are very good that it’ll pay off – but it may not be a home run the first time out.

The Bottom Line for IT Buyers
Really, Pervasive DataRush is one among several performance-enhancing approaches that offer potential additional analytical power in the next few years, and so if IT passes this one up and opts for another, they may well keep pace with the majority of their peers. However, in an environment that most CEOs seem to agree is unusually uncertain, out-performing the majority, and extreme IT smarts in order to do so, are more frequently becoming necessary. At the least, therefore, IT buyers in medium-sized and large organizations should keep Pervasive DataRush ready to insert in appropriate short lists over the next two years. Preferably, they should also start the due diligence now.

The key to getting the maximum out of DataRush, I think, will be to do some hard thinking about how one’s BI and data-processing applications “group” into dataflow types. Pervasive Software, I am sure, can help, but you also need to customize for the particular characteristics of your industry and business. Doing that near the beginning will make extension of DataRush’s performance benefits to all kinds of existing applications far quicker, and thus will deliver far wider-spread analytical depth to your BI.

How will a solution like DataRush impact the organization’s bottom line? The same as any increase in the depth of real-time analysis – and right now that means that, over time, it will improve the bottom line substantially. For that reason, at the very least, Pervasive Software’s DataRush is an Other BI solution that is worth the IT buyer’s attention.

1 comment:

Brooke Kay said...

Good work,I m glad to find such nice article. It is really a nice sharing .i usually visit this nice site and get many information. Keep it up thanks.