Data Monster

Why graphics processors will transform database processing

Illustration: Superbrothers

The history of technology is full of breakthroughs in one field that wound up working wonders in a related one. The 300B vacuum tube, introduced by Western Electric in 1937 to amplify telephone signals, found a far more enduring use as a high-fidelity audio amplifier. The atomic clocks first used in the 1960s by the U.S. military to track Sputnik and later to validate Albert Einstein's relativity theories are now the basis of the Global Positioning System. And of course, the magnetron, invented in the 1920s at General Electric and used in radars during World War II, later found itself repurposed as the basis for the microwave oven.

Now add another tech crossover: The graphics coprocessor, invented in the 1970s to churn through voluminous and repetitive calculations and render smooth and realistic-looking images on computer screens, can now chew on large-scale databases.

Database processing is a cornerstone of computing, and it is a market that last year generated approximately US $27 billion, according to technology analysis firm Forrester Research, in Cambridge, Mass. The firm projects that this number—which includes new database licenses, technical support, and consulting—will grow to $32 billion by 2013. Every time you bid on an eBay auction, search for a movie on Netflix, look for a Kindle title on Amazon, or do a Google search, massive database applications spring into action, delving into huge quantities of data spread across tens of thousands of machines.

This radical new task for graphics chips evolved from their role as the engine of computer games. So what does sifting enterprise-class databases have in common with rendering virtual monsters in a game? Both require handling huge amounts of data: Realistic-looking virtual monsters require generating millions of pixels every second, while searching large databases involves accessing millions of records per second. So why not take the same hardware that accelerates virtual monsters and put it to work on real-world applications, like the databases that are a large part of our daily lives—more so than pixel monsters?

For the past few years, first at the University of California, Santa Cruz, and now at Oracle, we have been looking for ways to leverage the power of these graphics processors, known as graphics processing units (GPUs). These special-purpose chips are designed to be paired with a central processing unit (CPU) for applications like games and scientific visualization, which demand high graphics performance. GPUs are the progeny of the old video cards, which did nothing but display memory contents on the screen. They ease the computational burden on the CPU by handling the calculations and other simple, highly repetitive operations necessary for rendering the lines, polygons, and surfaces of a full-motion graphics scene. For the price of a low-end computer, high-end graphics cards condense into a single PC card the processing power that just 10 years ago required a supercomputer.

A GPU can deliver hundreds of billions of operations per second—some GPUs more than a teraflop, or a trillion operations per second—while requiring only slightly more electrical power and cooling than a CPU. For the same levels of power and cooling, a GPU can deliver 20 to 30 times as much total computational power. That's a lot less power per calculation.

Take the Nvidia GeForce GTX 285 graphics card, for example. For the price of a quad-core CPU ($300) and 1.5 times the power consumption (200 watts) you get a processor that can rip through 1 trillion mathematical operations per second, about 20 times as fast as a CPU. And because you can plug up to four GPUs into a single server, the GPUs could be used to retrofit existing systems. Four GPUs in place of a single CPU would mean an 80-fold increase in performance for just six times the power consumption. So, with fewer machines, you get less heat and a lower air-conditioning bill.

Captivated by such numbers, researchers have already begun harnessing the power of server-scale GPU computing. In mid-2007, Nvidia unveiled a GPU-based supercomputer called the Tesla platform, which can crunch the numbers that riddle molecular biology, medical diagnostics, and geology. Newer Tesla systems cram up to four GPUs—nearly 1000 cores—into a single 4.45-centimeter-high rack-mount server capable of teraflops speeds. As of June 2009, the Tsubame cluster at the University of Tokyo, made up of 170 Tesla servers, ranked 41st on the biannual list of top 500 supercomputers. It achieved a peak performance of 77 teraflops on the Linpack benchmark, which emulates scientific applications that solve systems of linear equations.

For obvious reasons, we are particularly interested in any advance that promises significant performance gains in database processing. We have developed a parallel search algorithm that exploits the unique architecture of GPUs and has performed very well so far in our experiments.

It's a start, but there's more to be done before the technique is ready for prime time. The sheer performance of these chips, coupled with widespread R&D efforts and the commercial interest in higher performance at lower cost—and that includes power, cooling, and data-center space—suggests that GPUs could soon become a major force outside the graphics arena.

Until the mid-1990s, mainstream video cards were responsible only for displaying the contents of a specific memory region on the screen. Later, the graphics processor took over more and more of the CPU's tasks. But early GPUs were not programmable and could perform only very basic drawing operations.

With the growing demand for more- realistic computer games (the major force driving GPU evolution), the numbers started getting very interesting. A $400 off-the-shelf graphics card could compete with a top-of-the-line custom image-generator board that sold for $200 000 only a few years earlier. With all this power at their fingertips, developers naturally began to wonder: What else could you do with it? What we all really wanted to do was off-load the heavy lifting of compute-intensive tasks from the CPU to the GPU.

However, CPUs and GPUs have radically different architectures, so they require radically different programming approaches. From the programmer's perspective, a GPU is a parallel machine, whereas the CPU is a serial machine. A CPU executes one instruction after another, and each instruction does one thing—for instance, adding the contents of two memory locations or switching the contents of two memory locations. A GPU, on the other hand, can add many pairs of numbers at the same time, possibly all of them together at once. That's because it has many processors that are tightly linked to one another.

Why the different architectures? GPUs were built exclusively to plow through the myriad calculations that specify which pixels get lighted on the screen, and how. In computer graphics, scenes are made up of triangles. Each of those triangles is described by three vectors. (A vector is a set of coordinates that represents a point in three-dimensional space.) Each of those vectors consists of four values: x, y, z, and w (the w is just an algebraic factor to express a point using so-called homogeneous coordinates, which are much more convenient for geometrical transformations). So one triangle, with its three vertices, will be represented by {(x0, y0, z0, w0), (x1, y1, z1, w1), and (x2, y2, z2, w2)}. Add to that the fact that each pixel of the triangle must be colored. The color of a pixel is also described by a vector of four values: red, green, blue, and the percentage of transparency. Now consider that a typical graphics scene has millions of triangles and that a piece of animation has at least 24 scenes per second. Before the ascendance of GPUs, to draw a triangle in 3-D space, the CPU would have to compute exactly what the triangle looked like, where the three vertices were with respect to the viewer, what pixels were visible, what part of the background was visible, the colors of all the pixels, and so on. It took too long.

Although a GPU has a limited vocabulary, its level of parallelism is tremendous. It can handle tens, hundreds, even thousands of these triangles at once without breaking stride.

Early on, researchers had the bright idea of taking all that raw power and putting it to work on applications other than graphics. In 2002, Mark Harris, now a computer graphics researcher with Nvidia, coined the term GPGPU, for "general-purpose computation on graphics-processing units."

So what's the catch? Well, let's start with the obvious one. GPUs are designed to do one thing extremely well—and only one: draw triangles on computer screens. After all, they were primarily designed to cater to the video-game market. Until recently, GPU programming was considered one of the "black arts" of computer science: It was esoteric and difficult, the realm of just a few researchers and hackers. It required a two-step mapping process. First, a programmer had to take the problem's data and map it to graphics data that the GPU could handle. Second, he had to figure out a way to express the operations required to solve the problem in terms the GPU could understand. That is, he had to essentially lie to the GPU, representing the computation as a triangle-drawing operation, the only thing a GPU could understand and carry out [see sidebar, "Parallel Play"].

To do anything with those vertices, you had to use a combination of programming tools specifically built for graphics, such as OpenGL, Microsoft DirectX, Cg, or the jerry-built assembly languages of individual GPUs.

The exercise might have been fascinating for us geeks (and make no mistake, it was). But it was also time-consuming—far from commercialization and far from being more than just a riveting diversion. Then came CUDA. CUDA changed everything.

A visionary named John Nickolls, at Nvidia, recognized the huge potential of GPGPU. So in 2007, the Compute Unified Device Architecture was unleashed on the world. It divided the history of general-purpose processing on GPUs into two epochs: BC (before CUDA) and AC (after CUDA).

With CUDA, the GPU is much easier to program for general-purpose applications. Now it can understand instructions beyond drawing and digest data beyond coordinates and colors. For example, a programmer can now store the name "Jane Doe" as a simple sequence of characters {J,a,n,e,...} instead of as floating-point numbers.

Nvidia isn't the only company with a programming environment aimed at making GPGPU more accessible. Another major GPU manufacturer, ATI Technologies (now AMD Graphics Products Group), introduced a general-purpose GPU computing technology called Stream SDK. Both Nvidia and AMD Graphics, together with IBM, Apple, Intel, and others, are now putting their bets on OpenCL, a parallel programming language that will let developers write programs for CPUs, GPUs, and other processors in a uniform way. OpenCL is the logical evolution of CUDA. 

To understand why anyone would want to put GPUs to work on databases, you've got to first grasp the demands of modern database applications. Searching large-scale databases like Google or eBay usually means finding a small group of specific entries from among tens of millions of possibilities. A processor has to do this in milliseconds at most. And it has to perform this feat not for just one user but for hundreds or thousands of them, all searching the same database for different things at the same time.

Let's say you're looking on eBay for a replacement sound reproducer for a 1906 Edison windup phonograph. The sound reproducer converts the grooves on those old wax cylinders into sound. The specific part you're looking for is a 4-minute reproducer specifically suited to the hard plastic celluloid cylinders that were manufactured later in the century.

You type your query into eBay's search bar. That begins a chain of events. First, the form—the translator between the arcane code behind an application and what you see on your screen—collects your search keywords and sends them to the server.

The server application then checks those terms—"Edison sound reproducer"—against the approximately 20 million listings in eBay's currently open auctions to match some or all of your search terms to entries in the database. Within this cacophony, it finds 18 results that match.

Next, eBay's software displays these 18 entries on your screen. For each entry, it needs to load the item name, the associated pictures, the number of bids, and how many days the auction has left. And it has to cobble together all this information in under 2 seconds from multiple sources. Once all these different elements have been assembled as a Web page, the program loads them all into your browser.

But those 2 seconds must include not only the time it takes to search the database but also the time to retrieve and sort the results, load the pictures, and account for network latency, the network bandwidth, the connection speed, and the user's possibly slow PC. That means the time it took to access the wellspring of eBay data is only a minuscule fraction of the 2 seconds you waited to find out how many sound reproducers were on offer.

In fact, most of the time you spend waiting for a request is taken up by the transfer of data from the server application across various networks to your PC. Add up all those factors and the "reasonable response time" allowed to actually search eBay's databases ends up being on the order of a few milliseconds—at most.

Consider also that at any given moment, Google, Amazon, and eBay have to service huge volumes of requests. While you're looking for your sound reproducer, 200 people are looking for a purple sweater, 3000 are looking for deals on diamond rings, 6000 are combing through used textbooks, and so on. To give you some idea of the scope of that undertaking: With every query, eBay's software must comb through up to 24 million listings, Amazon probably sells hundreds of thousands of items per day, and Google's software must sift through the world's estimated 182 million registered domain names.

Disks—such as those used by these firms' servers—are too slow to deliver millisecond responses for more than a few requests at a time. Instead of waiting for hundreds or thousands of slow disks, wouldn't it be better to just keep the entire database in the much faster main memory? That's exactly what enterprise-class databases try to do: They chop up and distribute data across multiple servers specifically so they can fit the most frequently accessed pieces into memory. This guarantees subsecond response times even in heavy traffic, such as on Black Friday, when an online retailer might sell millions of items in one day.

But that scheme carries a substantial cost: Large server farms require large amounts of energy, cooling, and space. As the volume of online information and the user base keep skyrocketing—and as a result, the servers' workloads—companies have no choice but to keep increasing the number and performance of their servers. GPUs should be able to get the same job done while using a lot less electricity.

The enterprise software field has another reason to be chasing GPUs now. Enterprise data is all the information companies need to run their businesses. For eBay, that's every listing, user profile, feedback, and comment for its nearly 100 million users worldwide. Enterprise data have been growing at a slower rate than the number of transistors on microchips (see Moore's Law), so computer memory is growing faster than the amount of enterprise data.

The implications of that fact are tremendous: It is now possible to handle huge databases in main memory, which means that the data can be pulled in microseconds, as opposed to the several milliseconds needed for disk access. The actual time required for disk access is somewhere between 4 and 10 milliseconds, which might seem good enough, but a million accesses would add up to several hours. One application that can potentially benefit from GPUs is geospatial databases, whose data sets are similar to graphics data. These were actually the first to adapt GPUs for complex operations that require combining results from multiple data sets. The bottom line: When database performance is no longer limited by the response times of sluggish disk drives, a processor like a GPU can really shine.

Surprisingly, search applications have not yet leveraged the massive parallelism of GPUs. We have developed a parallel search algorithm that optimally exploits the GPU by applying a "divide and conquer" strategy to speed up individual searches. We call our baby p-ary search, where p is the number of processors. To understand how it works, first consider the way you look up numbers in a phone book. Say you're looking for the phone number of somebody named Godot. Instead of flipping through every page, you could open the book in the middle to see whether the first entry on the current page comes alphabetically before or after Godot. Based on the result, you choose the first half or the second half of the book and continue this strategy. With every iteration you cut the search range in half, until you're down to a single page. With this strategy, finding the page containing Godot in an imaginary telephone book with 1024 pages will require 10 iterations. In computing terms, this algorithm is called binary search, which is possible, of course, because the phone book data you are searching is already sorted.

You could speed up the process by conscripting three friends, tearing the phone book apart into four equal pieces, and assigning each person to search his or her own piece. If everyone is using binary search, this would cut the search time from 10 steps down to 8. Of course, only one of you will find Godot, while the other three waste their time searching in vain.

If you have a spare phone book, there is a faster alternative. Each person looks at the first name of his section. The person whose section's first entry (for example, Beckett) alphabetically precedes Godot holds the relevant part. The other three sections are discarded. The relevant section is again torn into four pieces and handed out. This process continues until each person ends up with a single page and someone finds Godot.

This algorithm converges significantly faster on the solution than the previous one. Each iteration takes up the same amount of time—one lookup, but performed by four people in parallel. With each iteration the search range is reduced to one-fourth instead of one-half. Now searching the 1024-page phone book requires only five steps: 1024 to 256 to 64 to 16 to 4 to 1 [see diagram, "P-ary Search"].

The astute reader will notice that the number of steps was reduced by only 50 percent, even with plenty of help. While this appears to be a small gain—and at considerably high cost—it becomes a lifesaver when you have to search tens of millions of entries in a fraction of a second. If the resources are available—and today's high-end GPUs have more than 200 cores—you can throw even more man power at p-ary search to improve responsiveness. Using 32 people (or processors), for example, you can search 1024 entries in two steps.

With p-ary search a GPU can outperform a similarly priced CPU by up to 200 percent for very large workloads: Thousands of searches over data sets with millions of entries are not an unusual scenario for large Web applications. Only then does the GPU reach its peak performance, pumping out more than 6 million search results per second. To do a single search of a 1024-page phone book, there is no reason to bother with a GPU.

That's because the GPU resides on a PC card. Any data, including search queries, need to be copied via the relatively slow PC-card bus before the GPU can get its hands on them. Applications with high computational intensity, such as physical simulations, do not suffer from these limitations. However, most database applications have a relatively low arithmetic intensity but require handling large amounts of data.

As graphics-card memory sizes increase at a rapid rate—up to 4 gigabytes on the newest models—using the video memory as a data cache reduces traffic on the PC-card bus, thus improving performance. Newer generations of GPUs will also enable asynchronous data transfers to and from main memory, which does not reduce memory traffic but allows the GPU to return results as they become available, thus improving response time.

More-powerful databases and more-realistic video games aside, GPUs per se do not enable anything radically different from what can already be done with today's CPUs. However, they may very well be the key to an epochal change. GPUs are democratizing supercomputing the way the PC democratized computing, making an enormous amount of computational power—previously the exclusive domain of government agencies, research institutes, and large companies—available to the masses.

For example, it is possible that somebody out there has the right algorithm to design a human-equivalent machine intelligence but could not afford enough computational power to make it work—until now. After all, what we humans call "experience" in artificial intelligence is a database.

About the Author

Andrea di Blas and Tim Kaldewey, both researchers at Oracle Corp., write about their work using graphics processors to comb through enterprise databases in "Data Monster". Di Blas, originally from Turin, Italy, was a researcher and lecturer at the University of California, Santa Cruz, when he met Kaldewey. "He was one of my most brilliant students," Di Blas says.

Related Stories

Advertisement
Advertisement