To hear those in the know tell it, database management systems leave a lot to be desired: They require too much hardware, and they make poor use of it. Some advocate a shift in the type of hardware used—away from the CPU toward graphics-processing units [see "Data Monster," September 2009]. But open-source database firm Ingres, of Redwood City, Calif., and start-up VectorWise, of Amsterdam, see the answer in a better use of the CPU.
Their prototype software has shown more than a 10-fold improvement in performance and an 80-fold improvement in some tasks.
To get such an improvement, database luminary Peter Boncz and others at the Dutch national math and computer science research institute, Centrum Wiskunde und Informatica (CWI), took a close look at how modern CPUs work and used what they found to make a database system from scratch. They formed VectorWise in 2008 and joined forces with Ingres and Intel this year.
Database systems today are written "for the machine of 20 years ago," says Bill Maimone, Ingres's chief technology officer. They can't easily take advantage of a modern processor's ability to perform a single instruction on a large set of data, and they're at the mercy of the relatively slow movement of data on and off the CPU.
To solve these problems, CWI computer scientists came up with versions of database operations that work on sets of 100 to 1000 values, or vectors, instead of on one database value at a time. As a result, some operations that take tens or hundreds of CPU clock cycles in other databases take just a handful in the VectorWise system.
The scientists also constructed the system so that all the work is done on data in the CPU's cache, where the processor cores can quickly get at it, instead of in main memory, which can take hundreds of clock cycles to fetch. This required them to compress the data in some parts of the cache and come up with fast decompression algorithms so that the process of fetching data didn't bog down.
Both tasks were helped by VectorWise's use of a database storage scheme called column-store. Data is sent from storage to the CPU as strings of values from the same attribute domain—for instance, a list consisting only of salaries rather than a record containing employee names, salaries, and other data, explains Daniel Abadi, an assistant professor of computer science at Yale University. Column-store makes it easier to perform vector calculations because all the needed values are stored contiguously. Column-store data is also easier to compress, he notes, because it has more inherent order to it.
Abadi calls VectorWise "a company to watch," but its software won't be for everyone. One limitation, he points out, is that the new system is designed to run on a single machine with a database of less than 10 terabytes. That would rule out most databases used by big retail firms.