I have a feeling that a fair amount of readers – especially
vendors and IT BI types – are going to be upset by what I have to say in this
post. However, viewing some of the
material that has passed across my desk recently, I really think it’s time to
raise the question of whether too much organizational power given to data
warehouse folks is beginning to cause some significant under-performance in
meeting today’s key organizational information management needs.
The immediate occasion for these reflections is that I am
partway through a book on a related subject that goes into some detail on data
warehousing’s view of the world: how BI
should be handled, what the organizational information architecture should be,
and how we got this way. This book will
remain nameless, because in many ways it’s an excellent primer. However, over the last 22-31 years (depending
on whether you count my software development days), I have had a
cross-organization, cross-vendor view of the same area, and I have to say that
the book redefines history and the purposes of various things in the ideal
information architecture in major ways.
Usually, I find that going over history just wastes time in
a blog post – but here, it helps to see how data warehouse concepts of common
information management terms make them reinterpret the purposes of the
underlying products, making the
information architecture – and the whole information handling process –
potentially (and, probably, actually) less effective in the medium and long
term. So let’s combine history and exposition of my assertion.
A Data Warehousing View of the World
In brief, the book’s view of the information architecture
seems to be as follows: Data of all types comes in to production systems, which
immediately pass it on to the data warehouse for cleansing and aggregation.
Behind the data warehouse is an optional operational data store for key data,
and things like master data management operate in parallel with the data
warehouse to provide a global view of multiple local ways to store customer
data. On top of the data warehouse are key Business Intelligence applications,
which include both repetitive, scheduled reporting and analytics.
Now, this view of the world seems reasonable if you were
born yesterday, or if you’ve spent the last fifteen years entirely in data
warehousing. However, there are, in my
view, some major problems with it.
In the first place, afaik, only in data warehousing are the
databases at the initial entry point referred to as “production systems”. For twenty years, I have been calling them
“operational databases”. In fact, they were business-critical before data
warehousing existed, and so were the apps on top of them – like ERP.
Why does this matter? Because it allows data warehouse folks
to shift the “operational data store” behind the data warehouse. The operational data store is a later
concept, and one that I (among others, I assume) wrote papers proposing around
2004 and 2005. The idea is that the data warehouse is simply too slow to react
immediately to key operational data – but that operational data is scattered
across multiple operational data stores, and so an “operational data store”
makes sure that a subset of operational data for quick decision-making is
either put in a central point for quick analysis in parallel with its arrival,
or monitored by a central “virtual database.” Putting the operational data
store behind the data warehouse defeats its entire purpose.
Likewise, the master data management system. I wrote papers
on this in assessing IBM’s version of the concept in 2006 and 2007. Again,
the notion was of combining operational data coming in to operational databases
– in this case, by enforcing a common format that allowed cross-organization
and cross-country leveraging of operational data by ERP and customer
intelligence apps. By redefining the master data management as existing within
the data warehouse or at the same remove from operational databases, data
warehouse folks ensure that master data management moves no faster than the
data warehouse.
And finally, there is the idea that (implicitly) analytics
is entirely contained in BI, and hence is entirely dependent on the data
warehouse. On the contrary, an increasing amount of analytics goes on outside
of BI. For example, analytics is part of
products that analyze computer infrastructure semi-automatically to optimize
performance or detect upcoming problems. Or, it is used to analyze key
computer-supported business processes.
This is “intelligence” in the sense of “military intelligence” –
proactively going out and finding out what’s going on – but it is not “business
intelligence” in the sense of finding out what’s going on inside and outside
the business on the basis of data that is handed to you, and that your
reporting tools are too slow or shallow to tell you. In other words, these
applications of analytics are entirely outside of a reactive data warehouse.
Why It Matters
There are two places that over-emphasis on data warehousing
can impede organizational BI and other information management
effectiveness: the information
architecture, and the organization’s “agility” in responding to new kinds of
information from outside. As I’ve suggested in the previous section, a data
warehousing view of the information architecture shifts operations that involve
lots of “updates” and data just arrived from outside to the data warehouse or
behind it. That means going through the
data-warehouse cleansing and aggregation process and arriving in a centralized
location that is handling queries from all over the organization and is
optimized for adding new data not “on the fly” but in delayed bursts. There is
simply no way that is going to be as timely as performing tasks on the data as
it arrives in the operational systems.
Just as troubling, the entire emphasis of the organization
is now more reactive and focused farther away from the organization’s
“antennae” to the outside environment. The IT organization appears to be
focused on responding to new demands from business for timelier data, not
actively seeking the latest new information and merging it back into existing
systems. The IT organization appears to emphasize cleaning up the data and
merging it and only then analyzing it at an internal “choke point”, rather than
handling the information faster where it arrives.
If you think these concerns are theoretical, think about the
case of social-media Big Data. Yes, Oracle as a major vendor is emphasizing
inhaling huge amounts of this data from multiple clouds into the data warehouse
and then analyzing it – when the whole purpose of the NoSQL movement is to
allow rapid in-cloud analysis of inconsistent, uncleansed data – but it would
not do so unless there was some organizational push to avoid analytics outside
the data warehouse. I conclude that
there is some strong evidence that a data warehousing focus is impeding
organizational ability to process and feed to business decision makers key
information in as timely a fashion as possible.
Moreover, there is some sense that this is not an
organizational quirk but a tendency so embedded in the IT organization that
this impediment is a symptom not of a temporary problem that is easy to fix,
but rather of an organizational “disease.” In other words, simply directing the
organization to pay more attention to doing social-media processing in the
cloud will probably not work.
Action Strategies and Conclusion
First (although I think there is little danger of this) I
must caution against throwing the baby out with the bathwater. There are very good reasons to have a data
warehouse performing the core functions of querying for BI. I have, in the
past, conjectured that if I were to design a new information architecture
today, I might not create a data warehouse or data mart at all – instead, I
might impose “data virtualization” and master data management tools over
existing operational databases. However, practically speaking, in most if not
all cases, the sheer experience behind today’s data warehousing products makes
them far more preferable for core functions.
Rather, I would suggest that data warehousing be placed
under, and be responsive to rather than dominant over, an information
architecture and information strategy function aimed more at the edge of the
organization than its central data center. This is not a matter of making the
organization more responsive to the business; it is a matter of making the IT
organization more agile (by my definition, which stresses the utility of
proactive and outside-the-organization-directed agility).
Until I saw this book, which suggested that data warehouse
folks had gone too far in asserting “IT information handling is all about the
data warehouse”, I was not too concerned about data warehousing folks; I would get
into annoying arguments with folks who thought I just didn’t “get” data
warehousing, but it seemed to me that the benefits of a powerful
database-related IT function outweighed the negatives of data warehouse folks’
“not invented here” blind spots. Now, I
am rethinking my position. If the result
of this type of rewriting of history is an increasingly sub-optimal information
architecture, then such a “disease” is not so harmless after all.
Does your organization suffer from data warehouse
disease? If so, what do you think should
be done about it?
No comments:
Post a Comment