Winner: Multicore Made Simple
Intel's Larrabee is a chip every designer already knows how to program
This is part of IEEE Spectrum's SPECIAL REPORT: WINNERS & LOSERS 2009, The Year's Best and Worst of Technology.
Corrected 14 January 2009
Chips based on Intel’s Larrabee architecture aren’t on the shelves yet, but the design is already a hot property because it promises to beat the chips that game designers now use to model graphics.
And the market for such graphics capability goes beyond video games. Consider Monsters vs. Aliens , which DreamWorks Animation plans to release later this year. In classic B-movie style, the film will be in three dimensions, but this won’t be your daddy’s 3-D. Because the digital stereoscopic system delivers a different perspective to each eye so deftly, it won’t strain your eyes or turn your stomach. That’s why the technology can show you giants battling not just for minutes at a time but for the entire hour, and why it’ll dominate the market as its ’50s-era forerunner never could.
Graphics like these don’t happen with commercial off-the-shelf graphics chips, and that’s where Intel’s Larrabee comes in. It’s an architecture, now in late development, for a multicore general-purpose graphics-processing unit (GPGPU), one of many in a new category that’s sometimes called a hybrid because it combines the functions of a multicore central-processing unit (CPU) with those of a graphics-processing unit (GPU). The idea is to do the jobs exclusive to each kind without wasting time on interchip chatter. But Larrabee has critical advantages over the other designs. First, Intel claims that it’ll provide greater speed at a lower cost. Second, Larrabee is based on Intel’s x 86 architecture, which millions of developers know like the backs of their hands, and it can be programmed in C and C++, languages they know like the roofs of their mouths. Finally, Larrabee is backed by the full weight of Intel.
Though graphics applications are what Larrabee’s intended for, they won’t be all it ends up doing. Chips based on the Larrabee architecture may one day allow your computer to watch a sporting event, identify the players, pick out the most significant, and generate highlight reels automatically. They may enable your laptop to sift through thousands of digital photos, correctly identify the people in each one, and label them so you can find exactly the snapshots you want. They may even be able to help researchers manage and then visualize vast amounts of data in such fields as genetics, geophysics, finance, and computational neuroscience.
To understand what’s at stake requires a bit of history. During the PC’s first 15 years or so, its every task was handled by a CPU, the increasing sophistication of which can be seen in the evolution of computer graphics. In the early years, computer characters and images were constructed out of giant pixels—think of the first several iterations of the game character Mario. In the 1980s, Mario was a vaguely humanoid clump made of squares; by 1996, Super Mario 64 had become a 3-D character limned by smooth curves, drawn in perspective. Computer animation had gained more detail; color palettes had become richer. Virtually all CPUs and software have evolved to the point where it’s no longer imperative to spend US $200 on a separate graphics card; without one, you can still get pretty good 3-D graphics.
The improvements came mainly by force: CPUs just got faster. They could handle the job because CPUs can, in principle, do almost any job. Rendering graphics was just one more thing they could handle—right up until computer graphics got a third dimension. At last CPUs had come up against a task for which their general-purpose design was poorly suited, and rendering slowed to a crawl.
The problem was that CPUs are designed to perform tasks one after the other, through the repetition of four main steps: the CPU fetches an instruction from the program memory, decodes it, executes it, and returns the results to memory for access later on. But generating the triangular building blocks for 3-D graphics, and updating them every time the screen refreshes, puts a great strain on computer resources. Accessing the memory from which the next instruction must be plucked is often slow, and that means the CPU sits idle as it waits for the instruction to be returned.
The computer industry solved the problem a number of years ago by off-loading most of the repetitive stuff to a companion processor optimized to handle such tasks—the GPU. It has two main advantages: parallelism and hardwired instructions.
Parallelism allows a GPU to retrieve many instructions from memory at the same time. Say you want to look for every mention of the word supercalifragilisticexpialidocious in a 400-page book. Using the CPU method, you’d be doing the equivalent of reading every single page until you found the word. With the GPU method, you’d rip the book into 400 pages and hand it out, four pages each, to 100 friends.
Hardwiring speeds things up by formulating frequently used instructions with dedicated circuitry rather than with software. It’s like signing your name with a rubber stamp instead of a pen. Hardwired instructions made GPUs faster at processing graphics, and GPUs became every gamer’s object of desire. But you still had to have that CPU, because GPUs are chips of much brawn but very little brain. They need a CPU to tell them what to do.
2 TERAFLOPS: Performance of Larrabee’s single graphics card, due to be released in 2009
Another way to think of it is to describe the GPU as the specialist and the CPU as the generalist. The specialist will get the task done much more quickly than the generalist—provided you give him the right task. For a GPU, the right task is any book that can be ripped into 400 independent pages. Graphics is one such task, and for its sake—particularly in game applications—the GPU became common.
Meanwhile, as game designers cried out for new capabilities and vendors answered by hardwiring them in, GPUs began to groan under the weight of the accreted wiring, much of it a pointless legacy from forgotten applications. In the end, GPU vendors simply could not provide the desired functions without fabricating a new GPU version every few months. Something had to give.
So GPUs needed to become programmable—in other words, more like CPUs. Thus was born the general-purpose graphics-processing unit: the GPGPU.
Nvidia and archrival AMD pounced on the task. AMD had acquired graphics legend ATI in 2006 to compete in the GPU market, and a few months later it released its Stream Processor line of GPUs. Meanwhile, Nvidia had already begun to augment its GPU chips with features normally associated with CPUs, including cache memory. But you can’t treat a GPU like a CPU, because GPUs are constructed differently. So in November 2006, Nvidia announced its proprietary software developer kit, CUDA (Compute Unified Device Architecture), which allows application developers to do nongraphics work on a graphics chip. CUDA was the companion language for Nvidia’s high-performance GPGPU line, Tesla, which the company introduced in June 2007. Tesla was targeted at the high-performance computing market: simulations for the oil, gas, and finance industries—anything, in other words, that required high computational power. AMD’s competing kit, Close to Metal, accompanied the Stream Processor chips, but it was no match for CUDA. Thus, thanks to its head start, Nvidia is now the leader in the GPU market, right ahead of AMD.
”Three years from now, the GPU and the CPU will be a single chip,” says Nathan Brookwood, an industry analyst with Insight 64, a semiconductor market research group. ”The question is, what is that chip going to look like?”
There are two schools of thought. Nvidia’s approach is to retrofit a GPU to give it some CPU functionality. The problem is the same problem GPUs have always faced, says David Kanter, the guru behind Real World Technologies, a leading semiconductor and technology analysis site: ”When you get into situations where you do need control logic, your performance drops off a cliff.”
Intel’s approach is to make a more parallel CPU, one that can play traffic cop to a huge horde of programmable GPUs. This means building the thing pretty much from scratch, except for the foundation in the x86 architecture.
To call Larrabee a multicore chip is a bit of a misnomer; it’s a ”many-core” chip. Many core, to oversimplify it, is to a GPU what multicore is to a CPU. These are not split hairs: Larrabee architects laid out their design differently from that of traditional GPU structures. Where most GPUs organize their processor cores as discrete blocks, Larrabee’s architecture has the core processors situated in line (rather than in parallel, as is typical with GPUs) and connected in a ring shape, with all those cores sharing access to the cache memory.
That approach sidesteps a typical GPU problem, in which one core processor modifies a block of memory without the other cores being aware of it, which slows down the work flow as they wait for that block of memory to be freed up.
And Larrabee’s been designed on a clean slate; many of the old hardwired instructions are gone. For example, it used to be that every time some new graphics technique was hardwired into a graphics chip, every single game suddenly had that one feature. Take ”lens flare”: typically, it’s an undesirable photographic artifact, but in the early ’90s it was modeled as one of the first special effects for computer graphics. Before long everyone was using it, and that’s how 1995 became the year of lens flare.
But these new tricks had a price; anything hardwired into the circuitry of a GPU could be done only in that one way. That’s not the case with Larrabee—all the instructions can be carried out by any software a developer can throw at it. Of course, Nvidia and ATI GPUs have also been highly programmable for at least five years. The difference is that a Larrabee chip will also be able to handle the nongraphics aspects of the software, like the complex physics of natural hair or cloud movement.
“In many ways, Larrabee is like the Cell processor,” says Insight 64’s Brookwood. But because of its arcane structure, the Cell processor is very hard to program. ”It drives game developers nuts,” he says. ”Larrabee will have the same capabilities, but it’ll be easier for a programmer to get his head around.”
And Real World’s Kanter questions Nvidia’s interpretation of the ”GP,” or general-purpose, part of a GPGPU. To be sure, he says, ”Nvidia packs a ton of execution resources onto a single chip to go after the graphics market.” But while the 240 cores of a recently released Nvidia Tesla GPGPU sure are brawny, they may not have the brains to compete with Larrabee. ”From a microprocessor-engineering perspective, a core is something that can get its own instructions, figure out what the instructions mean, and then do it,” Kanter says. ”What Nvidia is calling a core is really just the ’do it’ part, so their 240-core chips are not competitive with what Larrabee will offer.”
Expert Call: ”It’s a Moore’s Law solution looking for a billion-transistor problem.”
--T.J. Rodgers, founder and CEO,Cypress Semiconductor Corp.
But most of all, Brookwood says, Larrabee stands to benefit from the existential problems of CUDA. At this point, proprietary approaches are frowned upon—and CUDA is proprietary language for proprietary hardware. ”Software developers write stuff they want to be able to run on a lot of different platforms,” he says. ”If I educate all my programmers in CUDA, I can’t jump ship to a better solution later, because I’m locked into Nvidia.” Prior efforts to create an open GPU computing language resulted in Open Computing Language, which is being pushed heavily by Apple, and AMD’s high-level Brook language. But Brook is obscure and OpenCL isn’t quite ready for prime time.
Intel can fill that void with Larrabee for one very important reason: it’s based on the x86 architecture most modern programmers cut their teeth on.
To be sure, there are lingering questions. Larrabee isn’t yet a chip or even a prototype, just a planned prototype. The first test chips came out only a few weeks ago, and the product won’t reach store shelves until late 2009.
”This is the elephant that’s not in the room,” scoffs Andy Keane, general manager of Nvidia’s GPU Computing business unit. He says that by the time Larrabee enters into mass production, it will already be obsolete. Nvidia has already sold more than 100 million of its CUDA-enabled GPGPUs, and ”by the time Intel starts selling their Larrabee chips,” Keane says, ”we will have millions of our chips in laptops.” In an effort to make CUDA as familiar to programmers as C++, Nvidia launched the CUDA Center of Excellence program, which provides high-end equipment and support to universities in exchange for CUDA integration into the curriculum. To that end, last July Nvidia donated $500 000 and an $800 000 64-GPU CUDA technology cluster to the University of Illinois at Urbana-Champaign for computational biophysics research.
And it’s still tough to benchmark Larrabee, because Intel has not released key specifications, including clock speeds and even the number of cores. ”It’s too soon to declare it a winner, in my opinion,” Microprocessor Report senior analyst Tom Halfhill says. And Brookwood stresses that few applications have been written yet that can take advantage of many-core architectures.
But there is an elephant in the room, and its trumpeting can already be heard: the x86 cores that will populate the Larrabee-based chips. This is the architecture today’s developers grew up with; it’s what they learned to code using the most ubiquitous programming languages, a capability that Intel says none of the competing products will have. A chip as programmable as an x86 CPU could tilt the field to Intel’s advantage.
Better yet, it means application programmers will be able to avoid CUDA. ”CUDA is proprietary,” says an IT researcher at a major financial institution, who asked that his name be withheld. ”[Larrabee] is regular x86—now you don’t have to write into a special language to take advantage of it,” he says. ”Their proposition is that you can run a GPGPU without having to adopt a new programming paradigm.”
So who will win --Nvidia, whose Tesla chip was first to market and is now present in hundreds of thousands of laptops, or Intel, whose Larrabee has so many admirable attributes that won’t be real for another 12 months?
IEEE Spectrum has spoken to people inside and outside Intel, on and off the record, and we believe that Larrabee is a winner—because of the one feature that it alone offers: C++ programmability. That really handicaps the race. After all, once Larrabee and its software development environment are out, the tools will be ones that programmers are already familiar with. That means they’ll write software to run on x86, which will enable lots of third-party organizations to write more software and sell it to other developers. Lather, rinse, repeat.
Intel hopes that all the advantages of Larrabee chips—low price, high performance, ease of programming, greater flexibility—will lure in enough game developers to establish some momentum in the market, attracting a good portion of the many other x86 developers who are out there.
”Some people think that’s our secret agenda,” says Stephen Junkins, lead software architect for Larrabee’s software tools and technologies. ”It’s not much of a secret. We want to get it to smart people and let them pound stuff out on it.”
For more articles, go to Winners & Losers 2009 Special Report.
About the Author
BRIAN SANTO, the editor of CED magazine and a former Spectrum staffer, coauthored ”Multicore Made Simple”. ”Whenever Intel chooses to enter a market, there’s enormous potential for them to create fundamental changes in that market,” Santo says, explaining why Intel’s Larrabee chip was named this year’s semiconductor winner.
Snapshot: Cores Aplenty
Goal: To dominate the general-purpose graphics-processing unit (GPGPU) market.
Why it’s a winner: It uses the x86 architecture, which every software developer can already program using C and C++, instead of relying on some arcane proprietary language.
Who: Intel Corp.
Where: Teams are distributed, but the base is in Hillsboro, Ore.
Staff: ”Loads,” according to a secretive spokesman
Budget: ”Enough,” according to the same secretive spokesman
When: Prototypes and test chips available late 2008; scheduled for commercialization by late 2009 or early 2010