Chinese Chip Wins Energy-Efficiency Crown

Though slower than competitors, the energy-saving Godson-3B is destined for the next Chinese supercomputer

The Dawning 6000 supercomputer, which Chinese researchers expect to unveil in the third quarter of 2011, will have something quite different under its hood. Unlike its forerunners, which employed American-born chips, this machine will harness the country's homegrown high-end processor, the Godson-3B. With a peak frequency of 1.05 gigahertz, the Godson is slower than its competitors' wares, at least one of which operates at more than 5 GHz, but the chip still turns heads with its record-breaking energy efficiency. It can execute 128 billion floating-point operations per second using just 40 watts—double or more the performance per watt of competitors.

The Godson has an eccentric interconnect structure—for relaying messages among multiple processor cores—that also garners attention. While Intel and IBM are commercializing chips that will shuttle communications between cores merry-go-round style on a "ring interconnect," the Godson connects cores using a modified version of the gridlike interconnect system called a mesh network. The processor's designers, led by Weiwu Hu at the Chinese Academy of Sciences, in Beijing, seem to be placing their bets on a new kind of layout for future high-end computer processors.

A mesh design goes hand in hand with saving energy, says Matthew Mattina, chief architect at the San Jose, Calif.–based Tilera Corp., a chipmaker now shipping 36- and 64-core processors using on-chip mesh interconnects.

Imagine a ring interconnect as a traffic roundabout. Getting to some exits requires you to drive nearly around the entire circle. Traveling away from your destination before getting there, says Mattina, requires more transistor switching and therefore consumes more energy. A mesh network is more like a city's crisscrossed streets. "In a mesh, you always traverse the minimum amount of wire—you're never going the wrong way," he says.

On the 8-core Godson chip, 4 cores form a tightly bound unit—each core sits on a corner of a square of interconnects, as in a usual mesh. Godson researchers have also connected each corner to its opposite, using a pair of diagonal interconnects to form an X through the square's center. A "crossbar" interconnect then serves as an overpass, linking this 4-core neighborhood to a similar 4-core setup nearby.

Godson developers believe that their modified mesh's scalability will prove a key advantage, as chip designers cram more cores onto future chips. Yunji Chen, a Godson architect, says that competitors' ring interconnects may have trouble squeezing in more than 32 cores.

Indeed, one of the ring's benefits could prove its future liability. Linking new cores to a ring is fairly easy, says K.C. Smith, an emeritus professor of electrical and computer engineering at the University of Toronto. After all, there's only one path to send information—or two in a bidirectional ring. But sharing a common communication path also means that each additional core adds to the length of wire that messages must travel and increases the demand for that path. With a large number of cores, "the timing around this ring just gets out of hand," Smith says. "You can't get service when you need it."

Of course, adding more cores in a mesh also stresses the system. Even if you have a grid of paths providing multiple communication channels, more cores increase the demand for the network, and more demand makes traveling long distances difficult: Try driving across New York City at rush hour. Still, the bandwidth scaling of a mesh interconnect is superior to that of a ring, Tilera's Mattina says. He notes that the total bandwidth available with a mesh interconnect increases as you add cores, but with a ring interconnect, the total bandwidth remains constant even as the core count increases. Latency—the time it takes to get a message from one core to another—is also more favorable in a mesh design, Chen says. In a ring interconnect, latency increases linearly with the core count, he says, while in a mesh design it increases with the square root of the number of cores.

Reid Riedlinger, a principal engineer at Intel, points out that a ring interconnect has its own scalability benefits. Intel's recently unveiled 8-core Poulson design employs a ring not only to add more cores but also to add easy-to-access on-chip memory, or cache. As long as the chip has the power and the space, Riedlinger says, a ring makes it easy to add each core and cache as a module—a move that would require more complicated validity studies and logic modification in a mesh. "Adding the additional ring stop has a very small impact on latency, and the additional cache capacity will provide performance benefits for many applications," he says.

For those who are not building a national supercomputer, Riedlinger also points out that a ring setup is more easily scalable in a different direction. "You might start with an 8-core design," he says, "and then, to suit a different market segment, you might chop 4 cores out of the middle and sell it as a different product."

This article originally appeared in print as "China's Godson Gamble".

Related Stories