Computing

Data Torrents and Rivers

As the data streaming in from computers, sensors, and other devices swells, new infrastructure will be needed to manage the flood

There is a sea change coming from the plummeting cost of microsensor technology. This change will cause everything of material significance on the planet to be sensor-tagged to report its state or location in real time, which will create a whole new collection of applications to process and monitor streaming data. You have heard a lot about the WalMart project to sensor tag every pallet of goods that it purchases using RFID technology, but that is only the tip of the iceberg.

Right now, many Wall Street firms are using computers to do electronic trading. Such programs process real-time market feeds of bid-ask quotations, as well as executed trades to identify patterns of interest that can be exploited through trades. The volume of such feeds is going through the roof as electronic trading heats up. In addition, 100 milliseconds is ”forever” in electronic trading, so the latency requirements in feed processing are very stringent.

You have also heard about the possibility of a tsunami early-warning system, but in the U.S., it might be more important to have an alarm system for severe weather, especially wind gusts and tornados. Such a system is readily buildable by adding real-time processing to the current collection of weather sensors on the ground throughout the nation.

Lastly, there is considerable interest in performing real-time monitoring of computer networks. Although such monitoring can watch for denial-of-service attacks and like network-level collections of events, it is also possible to watch for application-level collections of events. For example, a fraud detection program in a financial services network might watch for the same user logging in from two different IP addresses that appear to be more than 1 mile apart.

Such applications have two common characteristics: large volumes of real-time messages (the fire hose); and the need to react instantly (where ”now” means ”right now”).

Traditional information processing architectures composed of database management systems (DBMSs), application servers, and messaging systems cannot meet these requirements and are ”non-starters” in this new world of data streaming applications. So let's discuss the reasons why traditional products fail and then turn to the origins of a new kind of system software, so-called stream processing engines (SPEs), that will take over this class of applications. Next, we'll continue with the characteristics of SPEs and, finally, present a vision of a sensor-tagged world, which is less than a decade away.

The Failure of Traditional Software

Consider an electronic trading program. In the simplest case, it is watching a feed of trades of the form:

(stock-id, price, shares-traded)

It computes some sort of ”secret sauce,” which traders will not divulge. One possible very simple secret sauce would be to compute the ”volume-weighted momentum” of each stock over the last four trades, defined for simplicity as:

(Last-price) - Sum (price * shares-traded) / Sum (shares-traded )

The idea is to compute this metric for every stock for every ”rolling window” of four trades. If the metric exceeds a positive or negative threshold, you then execute a momentum-oriented trading strategy. Of course, a real ”secret sauce” would be much more complex than our simple example, which serves only to illustrate the problems with traditional architectures.

The conventional approach to this problem would be to bring in the feed using some sort of messaging infrastructure. In low-speed worlds, this might be XML, but at high speed it is more likely to be a binary message transport system such as MQSeries or Tibco Rendezvous. Then, an application is run to load the data into a DBMS. A second application is programmed to compute the secret sauce and take appropriate action. Such a DBMS-centric architecture suffers from three notable problems.

First, there are three major subsystems involved in the application (messaging system, application server, and DBMS). There will be a large number of context switches between these subsystems in executing this application. Such context switches are very expensive and limit the volume of messages that can be processed and put a lower bound on achievable latency.

The second problem is the DBMS does not have the correct primitives to perform the desired computation. SQL DBMSs perform aggregates on stored data, for example:

Select avg (salary)

From Employee

Group by department

However, there is no notion of performing such aggregates on moving windows of tick data. Hence, the required computation must be laboriously computed by periodically polling the DBMS for changes or by using the trigger system for alerting purposes. In either case, SQL does not have time windows , which makes the computation difficult to program and expensive to run.

The third problem is even more fundamental. The traditional approach requires all the data to be loaded into a DBMS. This system will reliably commit the data to a disk repository and guarantee that it will never get lost. However, the application does not require data persistency at all; it merely needs real-time aggregates to be computed. As such, the latency and overhead of involving a traditional DBMS is not required, since its services are not needed.

For these three reasons, the specialized engines we will discuss in the following sections typically outperform the traditional architecture on these sorts of problems by two orders of magnitude. If you care about a factor of 100, then read on.

Stream Processing Engines

About five years ago, the research community became aware of the problems with traditional solutions, as well as the imminent arrival of a lot more streaming applications resulting from the sensor network sea change described above. There has been considerable research activity since that time in many aspects of sensor networks. This includes:

tiny operating systems to run on sensors

tiny database systems to run on sensors

lightweight networking fabrics, often with variable routing and intermittent (wireless) connectivity

single-pass data mining algorithms for streaming data

stream processing engines

We will focus on the last research area, that of constructing software engines capable of performing tasks like our momentum calculation with very high performance. Three research efforts stand out. The first is the Telegraph project at the University of California at Berkeley, which constructed an engine that is very good at reacting to hugely variable input conditions. The Berkeley team focused on dynamically changing the stream processing algorithms based on changing input conditions. A second effort was the Streams project at Stanford University. The focus of this effort was on precisely defining an extension to SQL for stream processing (so-called StreamSQL) and then implementing this language. The third effort was our work on Aurora and Borealis at Brown University, Brandeis University, and the Massachusetts Institute of Technology. Our focus was on finding a user-friendly graphical notation for StreamSQL and then implementing a very-high-performance prototype that would be able to make quality-of-service guarantees. The StreamBase engine is a commercialization and subsequent extension of this prototype. In the next section, we show how this SPE overcomes the three problems of a traditional architecture.

Stream Processing Engine Architecture

The StreamBase engine implements StreamSQL that has been entered through a graphical user interface. The major addition to SQL is the notion of time windows for joins and aggregates. Traditional SQL aggregates are defined over stored data, and a SQL engine knows when it is done with an aggregate because it gets to the end of a table. In contrast, a stream never ends, and a different notion is required. In StreamBase, aggregates and joins are defined over time windows that are supported directly by the engine. A time window can have a group_by clause just like regular SQL and is defined over a time window, not a table of data. This time window can be specified by:

A certain number of ticks, say 4. The next window can slide a certain number of ticks, say 1.

Wall clock time, say 9:00 to 9:01 is a window. The next window slides (say) 15 seconds.

A break point in the value of some other attribute in the incoming message.

Using the notion of windows, the aggregate specified in

(stock-id, price, shares-traded)

is straightforward for StreamBase; in fact it is defined on the tick-based window indicated above.

Besides having the right primitives that support real-time aggregates through time windows, an SPE has two other major features. The first is there is no requirement to store the data. Hence, an SPE processes StreamSQL on the bytes of the messages as they fly by, without loading the data into a persistent store. The overhead and latency of a DBMS is thereby avoided completely.

The third major feature is a single-process architecture. StreamBase applications can be spread over any number of computer systems. However, StreamBase runs as a single operating system process on each machine that it uses in the application. Inter-machine messages are handled by an internal transport system. Hence, there is no need for a separate messaging fabric. In addition, StreamBase supports user-defined functions, as well as user-defined aggregates. The code for these operations is dynamically linked into the StreamBase execution routine. Hence, StreamBase also includes capabilities, routinely found in an application server. Lastly, many streaming application require storing state persistently (such as yesterday’s stock momentum). To support this capability, StreamBase embeds a lightweight SQL database. The data is thereby stored in-process, and client-server architecture is not used, in contrast to traditional DBMSs. Using these techniques, StreamBase can process a distributed application, with no context switches at all.

These three techniques–the right primitives in StreamSQL implemented at the bottom of the system, a single-process architecture, and no requirement to persistently store the data–account for the two orders of magnitude performance advantage enjoyed by SPEs on this style of application.

In addition, there is one final advantage to SPEs that is worth mentioning. StreamBase is routinely called on to replace a custom-coded solution, since high performance was, up to now, only available from ”roll your own” implementations. Custom-coded solutions in a third-generation language like C++ or Java involve large amounts of code, thereby taking a lot of time and generating a maintenance headache. In contrast, a StreamBase application can be coded very quickly in its high-level primitives. Such applications can be built in days, not months, and are thereby easy to construct, easy to modify, and easy to maintain. The advantages in programming time and effort of a compact, easy to program StreamSQLnotation are dramatic and complement the performance advantages of an SPE.

The Coming World

As the cost of microsensors continues to decline over the next decade, we see a world coming when everything of material significance gets sensor tagged. This will include:

You and your kids at amusement parks (your wristband will be a tag). This will mean no more lost kids. A pleasant side effect will be line management for rides.

Your car. This will enable variable tolling on highways as well as rush-hour route optimization in urban areas. In addition, your insurance company may give you a break, depending on your exact driving habits.

Your (or your company’s) valuable possessions. Plastic property stickers will be replaced by electronic tags. This will be the end of furniture clerks conducting inventory for objects like overhead projectors. It will also make theft difficult, since an object can be ”paired” with an employee ID (also a tag) at the exit to a company building.

Library books: These will be tagged. If mis-shelved, they are not lost forever.

People requiring medical supervision: Vital signs monitors will send real-time health information over wireless networks to medical personnel. This will be good news for patients recently released from a hospital, as well as chronically ill patients. This is a next-generation medic-alert system.

Taxi cabs: Geo-positioning all taxi cabs, plus all patrons (through their cell phones), will allow one to hail a cab in New York City without having to wander into the street, wildly gesturing.

Cellphones (are sensors today): There will be a range of services, such as ”I am hungry, alert me if I pass within 2 miles of a restaurant that serves veal scaloppini.”

Grocery store items (as opposed to pallets): This will remove ”shrinkage,” as well as let your shopping cart perform automatic checkout. In addition, for better or worse, it will allow real-time coupons. ”Say Mike, you just put Diet Coke in your cart. Pepsi has a special deal just for you….”

Only your imagination will limit the applications of this technology. We expect the most significant applications are ones that we have not thought of yet.

Over the next decade, SPEs will become a substantial market, as the sea change in sensor networks creates a new class of applications.

Let's close with a short note on privacy. If you geoposition cars in real time, this will allow you to know exactly where your 16-year-old son is, rather than worry whether he is okay. However, it will also let your insurance company know exactly where you drive. (Certainly, this will be great fodder for divorce attorneys assisting warring spouses!) Hence, there are socially beneficial uses for the technology, as well as ones that invade privacy.

Privacy issues are only partly a technical problem. To ensure privacy, there are currently unmet technical requirements. For example, it must be possible to deactivate sensors, if the customer wants temporary anonymity. However, privacy is much more a legal problem, defining who can do what with real-time data. This issue already exists with sensitive medical and financial data. For example, I certainly want laws that preclude my credit card company publishing my card transactions (which it owns) on the Internet. Addressing these complex legal issues will keep attorneys busy for quite a while.

Summary

There is a brave new world approaching, where fire hoses of real-time data must be processed in milliseconds. A new class of system software, stream processing engines, is required to address this market. SPEs require the right primitives for real-time analytics, as well as a single process architecture and no requirement to actually store data persistently. Over the next decade, SPEs will become a substantial market, as the sea change in sensor networks creates a new class of applications.

About the Author

Michael Stonebraker is an adjunct professor of computer science at M.I.T, where he does research on stream processing engines and new database architectures. He is also the founder and chief technology officer of StreamBase Systems, in Lexington, Mass., which sells a high-performance SPE. He is also the architect of the INGRES relational DBMS, as well as the POSTGRES object-relational engine. Both are available as open-source software systems. Stonebraker was the 1988 recipient of the ACM System Software Award, the 1992 ACM SIGMOD Innovations Award, and the 2005 IEEE John von Neumann Medal, among many other honors.

IEEE Spectrum
FOR THE TECHNOLOGY INSIDER

Follow IEEE Spectrum

Support IEEE Spectrum

IEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.