This story was corrected on 19 February 2008.
8 February 2008—A method for catching and correcting timing errors in microprocessors could lead to a boost in processor performance or improved efficiency, say two teams of researchers that presented their work at the IEEE International Solid State Circuits Conference (ISSCC), in San Francisco, this week.
Two groups of engineers, one from Intel, the other from the University of Michigan and UK-based ARM, described similar methods for dealing with timing errors in processors. Such errors can result from manufacturing-process differences when the chips are being made, from variations in temperature across a chip, and from fluctuations in voltage from the power supply. Traditionally, chips are run at a higher voltage and slower clock frequencies than needed, reducing the likelihood of such errors and providing a safety margin. The trouble with this approach is that the errors engineers are trying to avoid are fairly rare, and that margin could be used for more calculations or to do the same number of calculations using less power. ”We’re giving up a lot of performance for a rare occurrence,” says Randy Mooney, director of I/O research at Intel.
David Blaauw, a professor of electrical engineering and computer science at Michigan and one of the presenters at ISSCC, estimates that as much as half of the potential performance of a chip is given over to the margin of error. ”We’re trying to design today as if the chip were always working in the worst-case condition.” In fact, he says, the error rate can be as low as roughly one error per 100 million instructions run.
The researchers’ response to this problem was to run the processors at lower and lower voltages until they started getting a significant number of errors. Blaauw likens the idea to a car on a racetrack: you don’t know you’re going around the curves as fast as possible until you’re actually scraping against the guardrails. Both the Michigan-ARM group and the Intel team took a similar approach, with some different decisions about trade-offs in the system. They both used some space on the chips to place extra circuitry that could catch the errors and then reran those instructions where an error took place.
On the chip, next to certain ubiquitous circuits called flip-flops, they placed similar circuits called latches. Though both the flip-flops and the latches had the same input, the data reached the latches a quarter or a half cycle later. Data is supposed to come into the flip-flop at a particular phase of the clock signal, but if there’s an error, it comes in late and the latch catches it instead. If the bits measured by the flip-flop and the latch don’t match, a controller knows there was an error and tells the processor to rerun whatever instruction was affected.
Of course it takes time to detect the error and rerun the instruction. But that minor drop in performance is more than compensated for by the performance gained when all the error-free circuits do their work so much faster. Blaauw says in his setup, which is known as Razor II and is a simplified version of a similar scheme he presented three years ago, detecting and correcting errors costs about 2.5 percent of the chip’s total power consumption. But by using Razor II, he can run it at so much lower a voltage that he’s putting 35 percent less power into the chip overall and getting the same performance. Intel, on the other hand, kept the power consumption the same and reported a gain in performance of between 25 percent and 32 percent. ”We work hard for a few percentage points of improvement, so getting this much is a lot,” Mooney says.
It could take about three years to develop the setup for commercial use, although Mooney says that Intel has no plans for a product based on the technology at the moment. Intel researchers are trying to implement the technology using fewer, smaller transistors and reducing the clock power to make the whole thing more efficient.
The Michigan group, meanwhile, already reduced the number of transistors as it moved to the second generation of its scheme. They’re in discussions with ARM about using Razor II. Blaauw says that their prototype was done with transistors built using 130-nanometer features, though commercial chips are already down to 65 nm. If anything, he says, the benefits of the system become greater at smaller sizes, where it’s even harder to get error-free circuits. ”The general trend is that these variations are getting worse,” he says.