The Era of Error-Tolerant Computing
Errors will abound in future processors...and that's okay
Illustration: Serge Bloch
The computer's perfectionist streak is coming to an end. Speaking at the International Symposium on Low Power electronics and Design, experts said power consumption concerns are driving computing toward a design philosophy in which errors are either allowed to happen and ignored, or corrected only where necessary. Probabilistic outcomes will replace the deterministic form of data processing that has prevailed for the last half century.
Naresh Shanbhag, a professor in the department of electrical and computer engineering at the University of Illinois at Urbana-Champaign, refers to error-resilient computing (also called probabilistic computing) by the more formal name of stochastic processing. Whatever the name, the approach, Shanbhag says, is not to automatically circle back and correct errors once they are identified, because that consumes power. "If the application is such that small errors can be tolerated, we let them happen," he says. "Depending on the application, we keep error rates under a threshold, using algorithmic or circuit techniques." For many applications such as graphics processing or drawing inferences from huge amounts of data, errors in reasonable numbers do not materially impact the quality of the results. After all, your eye wouldn't even notice the presence of a single bad pixel in most images.
The newfound attitude toward errors is, in a way, simply a pragmatic nod to a new reality. As transistor dimensions shrink, the minute variations in circuit patterns and in the composition of the silicon itself have a more noticeable effect. In particular, the number of dopant molecules in a transistor—key to its ability to conduct current—is now so small that a few more or a few less make a measurable difference. And the natural roughness of transistor structures is now big enough, compared to the size of the transistors, to impede their function. According to Kevin Nowka, faculty research program manager of the IBM Austin Center for Advanced Studies, designers of future chips will even have to contend with variations in such seemingly uncontrollable characteristics as the size of metal grains in the transistors' gates.
The result of all this variability is that the threshold voltage—the voltage at which the transistor switches from on to off—varies from device to device. And at the low voltages and high clock frequencies needed in today's low-power processors, that kind of variability means mistakes.
Several demonstration processors are in the works. Shanbhag's team is applying stochastic processing to a wireless receiver. They've created an algorithm and stochastic circuitry to implement a filter that consumes far less power than conventional filters and does so at similar error levels.
A team at Urbana-Champaign led by assistant professor Rakesh Kumar and a group at Stanford headed by assistant professor Subhasish Mitra are developing error-resilient processor architectures. Kumar says his project, called the Variation-Aware Stochastic Computing Organization (VASCO), manages errors through architectural and design techniques. In VASCO, the processor consists of many cores, and each core's reliability can be dynamically configured. The number and nature of the errors allowed by each core is chosen to match the error tolerance characteristics of the application be executed. Overall, the scheme reduces power consumption by allowing "relaxed correctness," he says.
Jan Rabaey, a researcher at the multiuniversity Gigascale Systems Research Center, based at the University of California, Berkeley, says error-resilient computers are perhaps 6 to 10 years away from being commonplace. But as variability worsens and power consumption continues to dominate the industry's concerns, Rabaey predicts that error-resilient systems will make steady inroads, ranging from exascale supercomputers down to smartphones.
Probabilistic computing is "unavoidable," Rabaey says. "We have to take a holistic look at how to handle errors, particularly as we scale chip dimensions down to levels where variability takes over." It is the only way he knows of to keep Moore's Law going, he says.
This story was corrected on 3 November 2010.