PHOTO: Intel
|
This story was
corrected on 19 February 2008.
8 February 2008—A method for catching and correcting
timing errors in microprocessors could lead to a boost
in processor performance or improved efficiency, say two
teams of researchers that presented their work at the
IEEE International Solid State Circuits Conference
(ISSCC), in San Francisco, this week.
Two groups of engineers, one from Intel, the other
from the University of Michigan and UK-based ARM,
described similar methods for dealing
with timing errors in processors. Such errors can result
from manufacturing-process differences when the chips
are being made, from variations in temperature across a
chip, and from fluctuations in voltage from the power
supply. Traditionally, chips are run at a higher voltage
and slower clock frequencies than needed, reducing the
likelihood of such errors and providing a safety margin.
The trouble with this approach is that the errors
engineers are trying to avoid are fairly rare, and that
margin could be used for more calculations or to do the
same number of calculations using less power. “We’re
giving up a lot of performance for a rare occurrence,”
says Randy Mooney, director of I/O research at Intel.
David Blaauw, a professor of electrical engineering
and computer science at Michigan and one of the
presenters at ISSCC, estimates that as much as half of
the potential performance of a chip is given over to the
margin of error. “We’re trying to design today as if the
chip were always working in the worst-case condition.”
In fact, he says, the error rate can be as low as
roughly one error per 100 million instructions run.
The researchers’ response to this problem was to run
the processors at lower and lower voltages until they
started getting a significant number of errors. Blaauw
likens the idea to a car on a racetrack: you don’t know
you’re going around the curves as fast as possible until
you’re actually scraping against the guardrails. Both
the Michigan-ARM group and the Intel team took a similar
approach, with some different decisions about trade-offs
in the system. They both used some space on the chips to
place extra circuitry that could catch the errors and
then reran those instructions where an error took place.
On the chip, next to certain ubiquitous circuits
called flip-flops, they placed similar circuits called
latches. Though both the flip-flops and the latches had
the same input, the data reached the latches a quarter
or a half cycle later. Data is supposed to come into the
flip-flop at a particular phase of the clock signal, but
if there’s an error, it comes in late and the latch
catches it instead. If the bits measured by the
flip-flop and the latch don’t match, a controller knows
there was an error and tells the processor to rerun
whatever instruction was affected.
Of course it takes time to detect the error and rerun
the instruction. But that minor drop in performance is
more than compensated for by the performance gained when
all the error-free circuits do their work so much
faster. Blaauw says in his setup, which is known as
Razor II and is a simplified version of a similar scheme
he presented three years ago, detecting and correcting
errors costs about 2.5 percent of the chip’s total power
consumption. But by using Razor II, he can run it at so
much lower a voltage that he’s putting 35 percent less
power into the chip overall and getting the same
performance. Intel, on the other hand, kept the power
consumption the same and reported a gain in performance
of between 25 percent and 32 percent. “We work hard for
a few percentage points of improvement, so getting this
much is a lot,” Mooney says.
It could take about three years to develop the setup
for commercial use, although Mooney says that Intel has
no plans for a product based on the technology at the
moment. Intel researchers are trying to implement the
technology using fewer, smaller transistors and reducing
the clock power to make the whole thing more efficient.
The Michigan group, meanwhile, already reduced the
number of transistors as it moved to the second
generation of its scheme. They’re in discussions with
ARM about using Razor II. Blaauw says
that their prototype was done with
transistors built using 130-nanometer features, though
commercial chips are already down to 65 nm. If anything,
he says, the benefits of the system become greater at
smaller sizes, where it’s even harder to get error-free
circuits. “The general trend is that these variations
are getting worse,” he says.