Winner: Cure for the Multicore Blues

Michael McCool has the prescription for programmers paralyzed by parallel processing

10 min read
Winner: Cure for the Multicore Blues
Illustration: Sean McCabe; Original Photo: May Truong

nothing but net

Illustration: Sean McCabe; Original Photo: May Truong
Michael McCool of RapidMind

Just three weeks before the 2006 Game Developers Conference in San Jose, IBM had a problem. The company desperately needed a boffo, unforgettable piece of ­computer-generated imagery to demonstrate the power of the new Cell nine-core microprocessor, which Big Blue had just developed with Sony and Toshiba. The chip, produced at a cost of US $400 million, was set to debut in Sony’s new PlayStation 3 game console in November, but developers who had been tearing their hair out trying to program games for the Cell’s new architecture didn’t yet have any seriously flashy footage to present at the March show.

So IBM turned to the chicken wrangler.

Actually, he’s a 39-year-old computer science professor and software entrepreneur named Michael McCool. In just one weekend, his company, RapidMind, in Waterloo, Ont., Canada, used the programming platform that McCool has been working on for nearly a decade to create a crowd simulation of 16 000 individual chickens.

Imagine the biggest flock of virtual fowl ever assembled. Each chicken is controlled by a simple artificial intelligence program, operating according to a handful of rules. Each chicken wants to move toward the rooster but must avoid collisions with other chickens, fences, and the barn. To do so, each one must constantly check the position of its nearest neighbors and other objects in its environment and then decide how to move.

If that doesn’t sound all that impressive to you, consider this: all 16 000 of those faux chickens are doing this maneuvering at the same time on a single Cell microprocessor. It is a chore that would tax a rack full of conventional servers.

After viewing the virtual barnyard at the IBM booth during the game conference, one new fan gave the RapidMind team a rubber chicken. The company’s developers stashed the gag gift near an air-hockey table in the office rec room. Now, every time programmers hit a new performance benchmark, one of them grabs the chicken and squeezes until it emits an unholy scream.

The masterminds at RapidMind thoroughly abused that poor bird as they prepared for last month’s release of the RapidMind Development Platform 2.0, the first software tool to help programmers write code for microprocessor chips like the Cell as well as for graphics processors from ATI Technologies, Nvidia Corp., and other companies. What the processors have in common is that they are all multi­core chips—that is, each individual chip has several or even dozens of processing units, called “cores.” By the middle of this year, RapidMind plans to release version 3.0 of the platform, designed to support multicore CPUs from Intel and Advanced Micro Devices.

RapidMind’s timing couldn’t be better. While the Moore’s Law–decreed doubling of transistors goes on unabated every 18 months, AMD, IBM, Intel, and others have determined that all those transistors can’t switch on and off much faster than they already do. Clock speeds top out at around 4 gigahertz, beyond which a microprocessor starts getting hot enough to spontaneously combust. So instead of making smaller chips that run faster, the near-term strategy is to keep chips the same size but put more processor cores in them.

What the Experts Say
NICK TREDENNICK: Efforts to extend standards-based, serial ­programming ­languages with ­features to describe parallel ­constructs are likely to fail. What is more likely to succeed are languages that raise the level of abstraction in algorithm description.

The multicore revolution started several years ago with graphics processing units (GPUs) made by ATI and Nvidia. Today, graphics chips sport dozens of cores. Now other kinds of multicore chips are establishing themselves in the mainstream: the Cell is already available in the PlayStation 3 and is moving quickly into servers, televisions, and other applications. And four-core CPUs from AMD and Intel are scheduled to ship within the next few weeks.

There’s more to come: Intel unveiled a prototype chip with 80 cores in September, part of a research project whose goal is to create a single chip capable of processing 1 trillion floating-point operations per second.

Of course, there’s a catch. The tantalizing possibilities of multicore chips—­stunningly realistic and densely populated games, faster scientific computations, more accurate modeling of seismic, medical, and financial data—all depend on the ability of programmers to routinely solve programming challenges beyond those they face today. Specifically, programmers are going to have to write programs that are divided into parts that run in parallel on several processors simultaneously, a chore that has proven fiendishly difficult in the past.

“We’re in a period of pain and turbulence for application designers,” says Carl Claunch, vice president of research and advisory services at Gartner Research, in San Jose. “Trying to do more and more in parallel adds stress, and we don’t have good tools for it right now.”

Developers are accustomed to writing programs that execute functions one after another in serial fashion on one or maybe two microprocessor cores. Before the debut of the Cell chip a year ago, parallel programming was largely confined to niches in high-performance computing and academic computer science. So until now, programmers hacking out the multi­core version of a game or three-­dimensional simulation have been literally left to their own devices.

The results aren’t shabby, but they’re far from optimal. Developers at Insomniac Games, the Burbank, Calif., publisher of Resistance: Fall of Man for the PlayStation 3, had to create their own programming tools and teach themselves how to allocate different programming tasks to the Cell’s nine different cores [see “The Insomniacs,” IEEE Spectrum, December 2006]. Their bootstrapping methods took them only so far, however. For instance, because their software couldn’t automatically allocate tasks to whichever core was available, Insomniac programmers had to dedicate two cores to handle collisions in situations where carnage and chaos among men, monsters, and machines needed to be approximated in real time and in living color.

Hina Shah, director of IBM’s Cell Ecosystem and Solutions Development unit, has heard from customers about the new challenges the Cell presents and has a full-time job seeking solutions to ease their pain. “Today, if a developer is going to program for Cell directly, they would have to change relevant parts of their application and manage all aspects of porting it to Cell,” she says.

“The nice thing about RapidMind is that you don’t need to change your whole program,” Shah adds. “You can just pick parts of your application that should be accelerated, and instead of changing that code to program all of Cell’s cores by hand, you simply use a programming interface that handles a lot of the complications on its own.”

Theoretically, RapidMind’s platform could help programmers code their entire applications to run on multiple cores. In practice, users have fed the RapidMind platform the most computationally intensive portions of their programs. The platform accelerates these chunks by breaking them up into smaller pieces and running them in parallel on several processor cores at once.

The RapidMind platform started as a language, called “Sh,” that McCool developed for graphics processors. The language grew out of an insight McCool had years ago—that the massively parallel computing provided by a graphics processor’s multiple cores can be used for things other than rendering pixels.

Recent results bear this out. Researchers at Hewlett-Packard, in Palo Alto, Calif., reported in November that a graphics processor programmed with the RapidMind platform executed an options-­pricing program called the “Black-Scholes benchmark” 32.2 times as fast as a general-­purpose CPU.

McCool, who exudes an endearingly geeky bravado and who still teaches computer science at the University of Waterloo, in Ontario, says putting the RapidMind platform in the hands of the people who need it the most was the best way to realize the full potential of his research. “It wasn’t really about us making lots of money—although that’s nice,” he says. “For me it was about cool technology and using it in the real world with real customers.”

So three years ago, he asked his research assistant Stefanus Du Toit to use the Sh language to create a programming platform for multicore processors. Together, McCool and Du Toit founded the company that would become known as RapidMind.

It took Du Toit about a year, but in the end, he and McCool had something good enough to show Matthew Monteyne, a former senior product manager with Waterloo’s most famous technology company, Research in Motion (RIM), maker of the BlackBerry wireless e-mail device. Monteyne, now vice president of sales and marketing at RapidMind, recruited his former boss, Ray DePaul, director of BlackBerry product management, to come on board as president and CEO. In McCool’s prototype, they both sensed an unusual opportunity.

“There hasn’t been a revolution in processors and how you program them since maybe object-oriented programming in the early ’90s,” DePaul says.

The introduction of a disruptive technology like multicore CPUs provides a great chance for small companies to pounce. “You don’t come into mature markets,” DePaul says. “You come in when there’s this whirlwind of activity, and the big guys are too focused on the current business that they can’t go after the new opportunity.”

McCool’s goal for a commercial product was simple enough: “I wanted to build something that I could teach in about 10 minutes, that you could use without mental overhead so you can focus on the algorithms, not the details of the particular processor,” he explains.

Programmers need to focus on devising parallel algorithms because RapidMind can’t write parallel algorithms for them. No software can. While there has been a lot of research into automatically parallelizing applications for programmers, no such system has been commercially viable. “People have been working on this for 20 or 30 years, and it doesn’t look like it’s a solvable problem,” McCool says.

That means programmers accustomed to writing serial algorithms must learn how to think about parallel algorithms. One of the benefits of working with the RapidMind platform is that users become familiar with a conceptual model of a parallel machine. “It’s similar enough to a real parallel machine that you can reason about what is an efficient way to implement an algorithm,” McCool says.

To write an application using RapidMind, the programmer first identifies the components to accelerate. These tend to be the numerically intensive operations. For instance, a chip running a game might spend a lot of time computing physical interactions between hundreds of thousands of objects, computations that would speed up tremendously if done in parallel. That’s in contrast to trivial operations such as tabulating the player’s score or processing input from a game-­controller button or joystick.

The RapidMind platform is designed to be incorporated into any program written in C++, one of the most widely used programming languages in the world. Programmers write their programs in C++, using their favorite C++ editing and debugging programs, of which there are hundreds. Next they select the portion of the program to be accelerated and formulate the necessary parallel algorithms. Then they write code that expresses those algorithms.

Several features make the task easier. Like any modern high-level programming language, C++ has a library of commonly needed subroutines and functions, simplifying life for programmers. When they need one of those functions—sorting a set of numbers, say—they merely insert a word in their program that calls it up. However, while working with the RapidMind platform, instead of writing code using ordinary C++ terms that refer to subroutines and functions in a C++ library, the programmer uses words from RapidMind’s vocabulary that refer to subroutines and functions stored in the RapidMind library. These words call up subroutines and functions that execute in parallel. The programmer must specify the data sets that will be operated on in parallel, but the subroutines take it from there.

Programmers don’t need to know any of the specifics of the chip their software will run on. When the program starts up, the RapidMind platform determines whether it is running on a graphics ­processor, a Cell, or something else and translates the code that the programmer has written into code that the particular chip understands.

At the same time, the platform breaks up arrays of data into chunks that get doled out to however many cores are available on the target chip. The more cores, the more finely the chunks are chopped. To ensure that each core is working on something all the time, the system assigns data and tasks to cores on the fly, depending on which ones signal that they are free for the next piece of work. So, for example, while one core is churning through an especially complicated operation for a long time, its fellow cores can be kept busy with lots of simpler operations.

What the Experts Say
GORDON BELL: Computer scientists haven’t been interested in programming clusters. If ­putting the cluster on a chip is what excites them, fine. It will still have to run Fortran!

Without such dynamic load balancing, computationally intensive applications, including real-time ray tracing, are extraordinarily difficult to pull off. Real-time ray tracing is a technique that models the paths and effects of light as it interacts with various surfaces. Typically, millions of rays hit dozens or hundreds of objects, where the rays can be absorbed, reflected, or refracted. Of course, most of the rays miss the objects and keep going—events that McCool calls cheap operations because the path they trace remains the same. The expensive calculations, the ones that must be performed to determine the trajectory of a ray when it hits a drop of water, say, can require 100 times as much work on the processor’s part.

Because the RapidMind platform can dynamically allocate both cheap and expensive tasks, the ray-tracing application can take full advantage of the power of parallel processing to execute in real time. That’s because there are so many pixels whose color and shading need to be determined at any one time that all the processors can be occupied with computational tasks. Compare that to, say, an Intel Xeon dual-core chip running an operating system, a Web browser, and some desktop applications. Its two processors might sit idle half the time waiting for something to do. The RapidMind platform strives to ensure that no core—or clock cycle, for that matter—goes to waste.


Photo: Rapidmind
RTT’S RAY TRACER exploits parallel processing to render a stunningly photo-realistic headlamp.

Real-time ray tracing has been a holy grail for makers of high-end visualization software for the automotive and aerospace industries. So when Munich-based Realtime Technology (RTT), whose customers include BMW, DaimlerChrysler, Lamborghini, Porsche, and Volkswagen, caught wind of McCool’s research three years ago, the company approached him to help it develop real-time ray tracing for its flagship DeltaGen visualization program.

RTT’s customers have specific needs that only real-time ray tracing can adequately address. They need to be able to tweak their designs—for example, for a car body—and instantly see, in a 3-D virtual-reality setup, how it affects the car’s design aesthetics. And they need to be able to see the effects of a tweak from multiple perspectives. Traditional rendering techniques don’t provide the realism necessary to visualize front and rear lamps, headlights, windshields, or even paint jobs.

Seven months ago, RTT customers got their hands on the world’s first real-time ray-tracing visualization program that runs on a single workstation, as opposed to a cluster of computers. Using an Nvidia Quadro FX5500 graphics processor and the RapidMind platform to accelerate the ray-tracing portion of the software, the RTT system’s virtual-reality models are stunningly photo-realistic.

At the 2006 Siggraph computer ­graphics conference in Boston last summer, RTT demonstrated its real-time ray tracer on the Cell. The migration from the GPU to the Cell took just a few days, mainly thanks to the RapidMind platform. RTT cofounder Ludwig Fuchs adds that his company is now enjoying a competitive advantage. “We know that other guys have tried to do a migration of real-time ray-tracing stuff on the Cell,” Fuchs declares, “and they did not succeed.”

With satisfied customers and IBM endorsing its approach, RapidMind has staked out a solid position on the frontier of multicore programming. But like any pioneer hunting for riches in unexplored territory, RapidMind will soon have company. PeakStream, based in Redwood Shores, Calif., was started in 2005 by Stanford University professor Pat Hanrahan and former Nvidia GPU architect Matthew Papakipos. The company has attracted $17 million in funding from Kleiner Perkins Caufield & Byers, Foundation Capital, and Sequoia Capital.

PeakStream is likely to need all that help as it chases RapidMind. At press time, PeakStream’s platform was still in beta and supported only one graphics processor, the ATI Radeon R580. And while PeakStream works to support other chips and the Cell, RapidMind is signing up more paying customers.

RapidMind’s BlackBerry veterans may have a sense of déjà vu: once again they’re at a sharply focused start-up in Waterloo with a novel technology and a wide-open market.

“The biggest challenge at RIM was: How do you make this extremely complex device easy for a Fortune 1000 company to buy?” recalls DePaul. “And that’s the model here. You can’t tell customers, ‘Don’t worry about C++ anymore, we’ve got something better.’ We’re making it extremely easy for them to take advantage of the multicore revolution without forcing them to learn something completely new.”

Easy for customers. Hard on the rubber chicken.

To Probe Further

For a deep read into the complexities of the Sh programming language, which is the basis of the RapidMind Development Platform, pick up: “Metaprogramming GPUs with Sh” by Michael McCool and Stefanus Du Toit, published by A.K. Peters Ltd., Wellesley, MA (2004) (

To read about how Hewlett-Packard researchers used RapidMind to accelerate various applications, go to

To learn more about RapidMind, go to its Web site:

This article is for IEEE members only. Join IEEE to access our full archive.

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, podcasts, and special reports. Learn more →

If you're already an IEEE member, please sign in to continue reading.

Membership includes:

  • Get unlimited access to IEEE Spectrum content
  • Follow your favorite topics to create a personalized feed of IEEE Spectrum content
  • Save Spectrum articles to read later
  • Network with other technology professionals
  • Establish a professional profile
  • Create a group to share and collaborate on projects
  • Discover IEEE events and activities
  • Join and participate in discussions