Expressway To Your Skull
PlayStation 3’s ability to blast data between chips is one of the secrets to a mind-bending gaming experience
If you believe the prerelease hype, Sony’s PlayStation 3 is the machine that is going to change the way we experience games. This past May, gamers got a taste of some much-awaited PS3 titles during the E3 conference in Los Angeles. In Resistance: Fall of Man, a first-person shooter set in a devastated England overrun by creepy creatures, bullets zip and thunk with stunning clarity, and blood sprays with gruesome realism. In Heavenly Sword, tables and bodies fly as in a martial arts movie while you face enemy squads controlled by artificial intelligence algorithms. And in Gran Turismo HD, a dozen racing cars speed and skid through the streets of Tokyo or on a dusty rally circuit that has the Grand Canyon as a backdrop.
Sony Corp., in Tokyo, has a lot staked on the success of the PS3—hundreds of millions of dollars, at least, and maybe even its future as the preeminent maker of consumer electronics. “Gamers are expecting a great deal from the PS3, because Sony has promised a lot,” says Brian O’Rourke, an analyst at market research firm In-Stat, in Scottsdale, Ariz. “More realism, good online experience, new and innovative games are probably the primary expectations from gamers.” The console, after one big delay, is supposed to go on sale in Japan on 11 November, and in the United States and Europe on 17 November.
Given the stakes involved, the press has lavished considerable coverage on many of the PS3’s cutting-edge technologies, including the console’s main brains, the Cell microprocessor, which Sony developed with Toshiba and IBM; the Blu-ray high-definition DVD system; and the game machine’s graphics processor chip, from Nvidia, which will be responsible for the promised photorealistic graphics.
But one crucial technology set to debut in the new console has received scant attention: the data-transfer connections, or buses, that link the Cell processor to both the console’s main memory and the graphics processor. Chip-to-chip connections may not seem like the most glamorous technology, but they are every bit as important as the PS3’s other advances, because without them the console’s chips would slow to a crawl.
In fact, the immersive experience Sony is aiming for depends on data flowing to and from the Cell processor at speeds way beyond anything achieved in a home-electronics system. The bus between the Cell and the PS3’s memory will achieve a peak data-transfer rate, or bandwidth, of 25.6 gigabytes per second. That’s about five standard DVDs per second—more than double what a high-end PC equipped with today’s fastest memory system can deliver. Meanwhile, the bus connecting the Cell to the graphics chip will move data at 35 GB/s, or about five to 10 times what you can get with today’s best PC-bus technology.
The console’s connections were developed by Rambus Inc., in Los Altos, Calif. Rambus has only recently begun to position itself as a chip-to-chip and board-to-board connection company. Most people still think that DRAM, or dynamic random-access memory—the most widely used type of memory—is the company’s main focus. But despite some interesting technology and initial support from Intel, Rambus’s DRAM didn’t win the PC market. It did, however, make it into two gaming systems: the Nintendo 64, released in 1996, and the PlayStation 2, which has already sold over 100 million units since its launch in 2000.
“Since the performance of the gaming sector is pushing the envelope [of computing], and Rambus is about pushing the envelope in those chip-to-chip connections, I think the gaming sector is a very good match for them,” says Michael Cohen, director of research at Pacific American Securities LLC, in San Diego. (Cohen personally owns shares of Rambus, but Pacific American has no financial stake in the company.)
System designers have long been warning that the performance of next-generation computer systems could be limited by the bandwidth among their key chips. So the need for speedier buses is acute. They could benefit not just future game systems and PCs but workstations, servers, high-definition TVs, and even supercomputers. Rambus and others see these interface technologies as a potential cash cow.
But creating superfast buses “is harder than most people realize,” says Steven Woo, a senior principal engineer at Rambus. “Basically, faster data rates require higher frequencies, and as the frequencies go up, it becomes increasingly difficult to maintain the integrity of the electrical signals and keep them synchronized.” To deal with those issues, the PS3’s buses rely on a signaling technique to minimize interference and attenuation, while a timing mechanism compensates for time-control discrepancies. Those features, Rambus says, will keep the bits flying in the PS3.
Switch on a game console and chunks of data—say, the first level of the game, with its scenarios, characters, weapons, and so on—flow from their permanent storage, normally a CD or a DVD, into the system’s main memory.
Based on what it sees in the memory, and by continually evaluating how characters and objects interact, the central processing unit then orchestrates how the game evolves. The CPU is constantly performing a massive number of calculations required to generate basic actions, like moving characters and changing the scenery, as well as advanced effects, like simulating rigid body dynamics, gravity, and other physical events.
The realism of today’s games, though, demands far more number crunching than the CPU alone can deliver. That’s where the graphics processing unit, or GPU, comes in. Every time an object has to be rendered on screen, the CPU sends information about that object to the GPU, which then performs more specialized types of calculations to create the volume, motion, and lighting of the object.
But despite churning through billions of floating-point math operations per second, or flops, today’s gaming systems and PCs still can’t deliver the realism that game developers seek. CPUs, memories, and GPUs just aren’t powerful enough—or can’t exchange data fast enough—to handle the complexity and richness of the games designers want to create. In other words, hardware constraints force them to reduce the number of objects in scenes, their motion and texture, and the quality of special effects.
The need for speed becomes even more critical in the PS3, whose Cell processor is actually nine processors on a single silicon die. In the Cell, one processor, or core, divides up work for the other eight cores, which were designed to stream through computation-intensive workloads like video processing, content decryption, and physics calculations. [For more on the Cell chip, see "Multimedia Monster," IEEE Spectrum, January 2006.] Using all its cores, the 3.2-gigahertz Cell processor can deliver a whopping 192 billion single-precision flops. Without a speedy connection to the PS3’s memory, the Cell starves for data.
To speed up data transfers between the Cell processor and its memory chips, the PS3’s designers adopted a novel memory system architecture that, Rambus says, addresses some of the limitations of current DRAMs [see “How the PlayStation 3 Shuttles Bits”]. To understand how these limitations came about, consider first the co-evolution of microprocessors and memory.
Moore’s Law tells us that transistor densities on chips are doubling every 18 months or so. This evolution has been accompanied by a doubling, on a similar time scale, in the clock rates of processor chips, basically because smaller transistors can toggle on and off faster. But memory clock rates, which serve as an indicator of memory data-transfer rates, are doubling much more slowly—about every 10 years. The result is that memory can’t fetch data to the processor fast enough, a bandwidth bottleneck that has increasingly constricted over the past few decades.
The bandwidth gap is just part of the problem. The other part is related to latency, the time it takes the memory to produce a chunk of data requested by the processor. This delay can vary from tens to hundreds of nanoseconds. That may not seem like much, but in a mere 50 nanoseconds a 3.8-GHz processor can go through 190 clock cycles. “You don’t want the processor waiting for that long,” says Brian T. Davis, a professor of electrical and computer engineering at the Michigan Technological University, in Houghton. The latency problem prompted chip makers years ago to embed some DRAM caches directly onto CPU chips, as well as to concoct some processing tricks to keep the wait for data as short as possible. Despite these improvements, modern CPUs can spend more than half their time—and often much more, Davis notes—just waiting for data to come from memory.
The growing gap between processor and memory performance isn’t anything you’d notice while reading e-mail, typing a report, or listening to music on your PC. But in high-performance systems like servers or demanding applications like three-dimensional games, this bottleneck becomes the system’s main performance limiter. To tackle the problem, one key design shift has been to integrate the memory controller into the processor. In most PCs today, processor and memory communicate via an intermediate chip—the memory controller—an additional step that adds latency. The PS3 has its memory controller in the Cell. (AMD’s Opteron was one of the first general-purpose processors to include an on-die memory controller.) The integration can cut memory latency roughly in half, and it lets the processor take advantage of the memory’s full speed.
But achieving any further reductions in latency would be difficult, requiring redesigns of the DRAM’s bit-storage and bit-retrieving mechanisms. So engineers are instead concentrating on maximizing memory bandwidth. “The way that the memory interfaces have changed, it really has been about bursting wider and wider chunks of data across a channel,” says Graham Allan, a director of marketing at Mosaid Technologies Inc., in Kanata, Ont., Canada, which develops memory controllers and chip-to-chip interfaces.
Two ways of boosting memory bandwidth are to widen the data bus between processor and memory and to increase the rate at which bits are conveyed. It’s like adding lanes to a superhighway while also raising the speed limit. But widening the data bus requires adding more pins to the chips and more wires—the copper traces on a printed circuit board—to connect the chips. That extra circuitry pushes up costs, a highly undesirable outcome in today’s price-sensitive market for PCs and game consoles.
So Rambus focused instead on increasing bandwidth by transferring data at a higher rate. The result is a kind of turbocharged DRAM that Rambus named XDR, or eXtreme Data Rate, DRAM. Its core, where the bits are stored, looks like that of an ordinary DRAM: a matrix of transistors and capacitors. Its clock, however, runs much faster than those of conventional DRAMs. XDR chips do that by taking a 400-megahertz clock—used by some circuits in the PS3—and multiplying its speed by a factor of four, to 1.6 GHz.
In addition, XDR memory transfers data on both clock edges—the instant a bit can be sent during a clock cycle—which effectively doubles the frequency to 3.2 GHz. Given that each XDR unit has a 2-byte-wide bus, each memory chip can thus exchange data at 6.4 GB/s with the Cell processor. The PS3 has four XDR chips, bringing the console’s total bandwidth to 25.6 GB/s. By comparison, the PlayStation 2, with two DRAMs, achieves 3.2 GB/s.
To boost bus performance to such blistering speeds, however, you have to overcome a fundamental engineering challenge: making sure the bits arrive in good shape at the other end.
“It basically comes down to signal integrity,” says Jason D. Bakos, a computer science and engineering professor at the University of South Carolina, in Columbia. He says that at frequencies of 1 GHz and higher, a number of problems crop up as your signals whiz along the copper traces on a standard epoxy-based printed circuit board—the kind of board used in most PCs and game consoles. Crosstalk from neighboring traces distorts signals, and electrical variations along the interconnects degrade the transmission even more. In addition, the capacitance of the traces transforms them into low-pass filters, which means that low-frequency components of the signals go through while high-frequency components—the ones you need to convey your high-frequency signal—are attenuated.
To solve these problems, Rambus adopted a data-transmission technique known as differential signaling, to send bits between the memory and the processor. Differential signaling had been used before in the control signaling of DRAMs—as well as in other common data-transfer interfaces such as FireWire—but this is the first time the technique is being used for data signaling in DRAMs.
In conventional DRAM designs, one wire is used for each bit of data sent. If you’ve got a 64-bit-wide memory bus, you need 64 wires. To send a bit representing a 1, the memory drives the wire to a predetermined high voltage, say 1.8 volts. To send a 0, the voltage gets dropped down to a predetermined low voltage, say 0 volts. This is called single-ended signaling.
With differential signaling, two wires in parallel send one bit of data. To represent a 1, one wire is driven to a high voltage while the other is driven to the low voltage. To represent a 0, the voltages on the wires are reversed. To know which bit is being transmitted, all you need to do is compare the two signals at the receiver. XDR uses a high voltage of 1.2 V and a low voltage of 1.0 V, a difference of just 200 millivolts. The reduced voltage difference means less power dissipation, which in turn means less noise. Differential signaling strengthens the communication even further because it’s more immune to crosstalk and spurious noise than single-ended signaling. “Because noise affects both wires similarly, the voltage difference between the wires is maintained, and so is data integrity,” says Rambus’s Woo.
So why is it that using two wires for each bit doesn’t mean doubling the number of pins in the XDR chips and the number of traces on the board? Because the data rate of a pair of XDR pins is more than double that of a single pin in a conventional PC memory: 3.2 gigabits per second per pin pair compared with 533 to 667 megabits per second per pin. In other words, the data rate acceleration more than compensates for the use of pairs of pins.
After solving the noise problem, Rambus had to deal with yet another issue that becomes critical when you break the multigigabit-per-second barrier: synchronizing the signals.
In a standard circuit board, signals travel through traces at around half the speed of light. At such speeds, differences of millimeters in the length of traces means that signals sent at the same time will not show up together at the receiver, says Tony Chan Carusone, a professor of electrical and computer engineering at the University of Toronto. Such a timing discrepancy, he says, becomes a major problem if signals carrying not only data but also clock, address, and command information are not all in sync.
In traditional buses, traces are matched in length by adding zigzags to some of them; the zigzags precisely delay signals so that all of them arrive at the same time. The main drawback with this approach, called trace-length matching, is that it can make the board designer’s job painstakingly complex, and even with traces of the same length, manufacturing variations can create unexpected delays.
To solve the timing problem, Rambus put some “smarts” on the memory interface embedded on the Cell processor. The smarts consist of a bunch of programmable delays. With each wire, you can actually delay when it sends a bit out to the DRAM and when the data comes back from the DRAM. Rambus dubbed this technique FlexPhase. The company is not forthcoming with the nitty-gritty of how it works, but this much is known: when the PS3 is powered up, a special patch of circuitry inside the Cell’s memory interface transmits data across the bus, measuring the flight time for each bit to determine the length of each trace. This information in turn determines each wire’s delays. The technique compensates for trace length differences and also for manufacturing, temperature, and voltage variations that may affect signal propagation time. It’s a particularly useful trick in a mass-produced system that needs to be low-cost and will work in places that are not exactly the air-conditioned environment of a server room. Plus, it saves on the real estate of the board, possibly reducing its cost, because you don’t have traces squiggling all over the place.
Some DRAM experts, however, say that FlexPhase is a nice feature, but it may not be necessary after all. Standard DRAMs, they say, use a timing reference signal sent with each byte of data to keep everything in sync, and this method should work just fine as DRAM chips enter the multigigahertz range. Whether that will be the case or not, it’s too early to know, but one thing is for sure: timing issues will play a key role in future memory systems.
The same signaling tricks used in the PS3’s DRAM—namely, differential signaling and programmable delays—are also used to speed up the bus linking the Cell processor and the GPU. Developed by Nvidia Corp., in Santa Clara, Calif., the PS3’s graphics processor, called RSX, is a 300-million-transistor, 550-MHz chip capable of performing 2 trillion flops. It has its own 256 megabytes of GDDR3, a graphics-specific type of DRAM.
Recall that the GPU takes orders from the CPU to render images; that’s why it is so important that the communications bandwidth between them be as big as possible. The GPU receives information about the different objects and then maps colors and textures onto them. In 3-D games, the surface of each object is composed of a huge set of tiny triangles. The more triangles, the finer the resolution and the more real the image looks. Some games now have up to 10 million triangles in a given frame.
In the PS3, the Cell and the RSX are connected by a Rambus interface technology, which, sure enough, Rambus has given a catchy name—FlexIO. The total bus width between the two chips is 7 bytes: 4 bytes to move data from the Cell to the RSX, 3 to move data in the opposite direction. This setting gives a bandwidth of 20 GB/s outbound from the Cell and 15 GB/s inbound—almost 10 times as fast as PCI Express, an interface standard popular in today’s PCs. Thanks to FlexIO, the Cell processor can fling an incredibly large number of triangles to the RSX, and the result is more details and more objects in 3-D games.
The FlexIO interface could also be used to connect two or more Cell processors, a configuration that could be used in high-performance computers and other advanced systems. In fact, FlexIO joins a zoo of other interface technologies aimed at chip-to-chip and board-to-board connections, including the AMD-backed HyperTransport and RapidIO, supported by Freescale Semiconductor, Lucent, EMC, and others. Demand for such interfaces should increase as systems manufacturers add multiple processors to future electronic devices [see “ ”].
Future Gaming consoles will continue to demand ever-faster buses, but how much bandwidth is enough will vary from system to system. For instance, one of PlayStation 3’s main competitors, Microsoft’s Xbox 360, released last year, relies on a CPU-GPU bus with a bandwidth of 21.6 GB/s, half in each direction. It’s a proprietary interface developed by IBM that runs at 5.4 GHz and relies on differential signaling to maintain signal integrity. It may not be as fast as PS3’s, but Xbox 360 owners don’t seem disappointed.
In fact, just because PS3 has more powerful data pipes, that doesn’t mean its games will easily get the full benefit from them. As in any other computer system, software, not just hardware, matters. Game developers will have to design their code carefully to make sure that the Cell is getting the types of workloads for which it works best, and that data are streaming smoothly between processor, memory, and GPU.
Assuming that the PS3’s software matches the promise of its hardware, the much faster data flows in the PS3 are a significant leap toward the holy grail of immersive photorealistic environments, bringing us a bit closer to the day that games become so absorbing they will evolve into portals where the boundaries of human and machine realities blur into a finger-twitching frenzy. But that’s years, if not decades, away. In the meantime, the best way of grasping the cutting edge of data-bus design will be to grab a joystick and blow those slimy monsters to kingdom come.
To Probe Further
For more on memory bandwidth and trends, see https://www.cs.virginia.edu/stream and https://www.research.ibm.com/ismm04/slides/woo.pdf.
For more on the architecture of the Microsoft Xbox 360, see https://www-128.ibm.com/developerworks/power/library/pa-fpfxbox.