Building high-performance computers used to be all about maximizing flops, or floating-point operations per second. But the engineers designing today’s high-performance systems are keeping a close eye not just on the number of flops but also on flops per watt. Judged by that energy-efficiency metric, some digital-signal processing (DSP) chips—the sophisticated signal conditioners that run our wireless networks, among other things—might make promising building blocks for future supercomputers, recent research suggests/
The DSPs that might make the jump to supercomputing come from Texas Instruments, which originally designed them for relatively modest applications. “We hadn’t even thought to look at high-performance computing or supercomputers,” says Arnon Friedmann, multicore business manager for TI. “It wasn’t on our radar.” TI’s DSP chips are typically used in embedded systems, most prominently cellular base stations. For such applications, power efficiency is vital, but until recently these systems didn’t require floating-point calculations, making do instead with just integer arithmetic. The advent of 4G cellular networks, however, increased the computing burden within base stations, making floating-point calculations essential.
TI engineers added floating-point hardware to the TMS320C66 family of multicore DSPs late in 2010 without appreciably slowing these processors down or increasing the power consumed. But it was only after the new chips came out that some forward thinkers at TI realized that the eight-core C6678 DSP, which can perform as many as 12.8 gigaflops per watt running flat out, might be useful for general-purpose high-performance computing.
“The question was whether we’d be able to extract that potential in the real world,” says Francisco D. Igual, now a postdoctoral researcher at Universidad Complutense de Madrid. He was working at the University of Texas at Austin with engineering professor Robert A. van de Geijn when TI approached them for help. Collaborating with TI, they wrote code for the new DSP to perform general matrix-matrix multiplication, something they felt would be representative of the kind of numerical weight lifting that high-performance systems are often asked to do.
With that code in hand, the team compared the new DSP chip against some common supercomputer architectures. The DSP came out on top, at 7.4 gigaflops per watt. “We were happy, but we were not that surprised,” says Igual.
But not everyone is swayed by those results. “It’s very impressive, but not a fair comparison,” says John Shalf of the National Energy Research Scientific Computing Center at Lawrence Berkeley National Laboratory, in California. Shalf points out that the comparison was done using single-precision (32-bit) arithmetic across the board. That’s because this DSP chip can complete a single-precision operation in one clock cycle, whereas double-precision calculations take four clock cycles and thus use four times as much energy. So the number-crunching circuits in each of the competing systems tested, which are configured for efficient double-precision arithmetic, were at a disadvantage.
Texas Instruments hopes to improve the efficiency of double-precision operations on its multicore DSPs. But the energy used for double-precision calculations would, at the very least, still be double what researchers found in their single-precision tests, meaning these chips would at best be able to perform in the range of 3 to 4 gigaflops (double precision) per watt.
And individual chips do not a supercomputer make, Shalf stresses. Combining many of them in a high-performance computer, with its various electronic subsystems and cooling apparatus, would significantly lower the machine’s overall energy efficiency.
How, then, would it compare with today’s best supercomputers? The current world champion in flops is the Sequoia supercomputer at Lawrence Livermore National Laboratory, in California. With its IBM BlueGene/Q architecture, it performed just over 2 gigaflops (double precision) per watt in recent benchmark tests—similar to what you’d expect from a double precision DSP-based machine, judging by the research results of Igual and his colleagues.
But perhaps this shouldn’t be so surprising, given BlueGene’s ancestry. Two decades ago, TI produced a line of DSPs that, like the company’s new multicore family, contained floating-point hardware. In 1998, physicists from Columbia University, in New York City, ganged thousands of them together to construct a special-purpose supercomputer for performing calculations in quantum chromodynamics, dubbing it QCDSP, for Quantum Chromodynamics on Digital Signal Processors. Texas Instruments later dropped that DSP line, but Alan Gara, one of the three physicists who pioneered the design of the QCDSP, retained some of its lessons when he moved to IBM, where he became the chief architect for the BlueGene supercomputers.
You might guess that this groundbreaking DSP-based supercomputer is now enshrined in a computer museum. But in July, when IEEE Spectrum caught up with Columbia physicist Robert Mawhinny, who worked alongside Gara on that project, he told us that it met a sadder fate: “We threw it out last week—literally.”