There is a sea change coming from the plummeting cost of microsensor technology. This change will cause everything of material significance on the planet to be sensor-tagged to report its state or location in real time, which will create a whole new collection of applications to process and monitor streaming data. You have heard a lot about the WalMart project to sensor tag every pallet of goods that it purchases using RFID technology, but that is only the tip of the iceberg.

Right now, many Wall Street firms are using computers to do electronic trading. Such programs process real-time market feeds of bid-ask quotations, as well as executed trades to identify patterns of interest that can be exploited through trades. The volume of such feeds is going through the roof as electronic trading heats up. In addition, 100 milliseconds is ”forever” in electronic trading, so the latency requirements in feed processing are very stringent.

You have also heard about the possibility of a tsunami early-warning system, but in the U.S., it might be more important to have an alarm system for severe weather, especially wind gusts and tornados. Such a system is readily buildable by adding real-time processing to the current collection of weather sensors on the ground throughout the nation.

Lastly, there is considerable interest in performing real-time monitoring of computer networks. Although such monitoring can watch for denial-of-service attacks and like network-level collections of events, it is also possible to watch for application-level collections of events. For example, a fraud detection program in a financial services network might watch for the same user logging in from two different IP addresses that appear to be more than 1 mile apart.

Such applications have two common characteristics: large volumes of real-time messages (the fire hose); and the need to react instantly (where ”now” means ”right now”).

Traditional information processing architectures composed of database management systems (DBMSs), application servers, and messaging systems cannot meet these requirements and are ”non-starters” in this new world of data streaming applications. So let's discuss the reasons why traditional products fail and then turn to the origins of a new kind of system software, so-called stream processing engines (SPEs), that will take over this class of applications. Next, we'll continue with the characteristics of SPEs and, finally, present a vision of a sensor-tagged world, which is less than a decade away.

The Failure of Traditional Software

Consider an electronic trading program. In the simplest case, it is watching a feed of trades of the form:

(stock-id, price, shares-traded)

It computes some sort of ”secret sauce,” which traders will not divulge. One possible very simple secret sauce would be to compute the ”volume-weighted momentum” of each stock over the last four trades, defined for simplicity as:

(Last-price) - Sum (price * shares-traded) / Sum (shares-traded )

The idea is to compute this metric for every stock for every ”rolling window” of four trades. If the metric exceeds a positive or negative threshold, you then execute a momentum-oriented trading strategy. Of course, a real ”secret sauce” would be much more complex than our simple example, which serves only to illustrate the problems with traditional architectures.

The conventional approach to this problem would be to bring in the feed using some sort of messaging infrastructure. In low-speed worlds, this might be XML, but at high speed it is more likely to be a binary message transport system such as MQSeries or Tibco Rendezvous. Then, an application is run to load the data into a DBMS. A second application is programmed to compute the secret sauce and take appropriate action. Such a DBMS-centric architecture suffers from three notable problems.

First, there are three major subsystems involved in the application (messaging system, application server, and DBMS). There will be a large number of context switches between these subsystems in executing this application. Such context switches are very expensive and limit the volume of messages that can be processed and put a lower bound on achievable latency.

The second problem is the DBMS does not have the correct primitives to perform the desired computation. SQL DBMSs perform aggregates on stored data, for example:

Select avg (salary)

From Employee

Group by department

However, there is no notion of performing such aggregates on moving windows of tick data. Hence, the required computation must be laboriously computed by periodically polling the DBMS for changes or by using the trigger system for alerting purposes. In either case, SQL does not have time windows , which makes the computation difficult to program and expensive to run.

The third problem is even more fundamental. The traditional approach requires all the data to be loaded into a DBMS. This system will reliably commit the data to a disk repository and guarantee that it will never get lost. However, the application does not require data persistency at all; it merely needs real-time aggregates to be computed. As such, the latency and overhead of involving a traditional DBMS is not required, since its services are not needed.

For these three reasons, the specialized engines we will discuss in the following sections typically outperform the traditional architecture on these sorts of problems by two orders of magnitude. If you care about a factor of 100, then read on.