In January’s special Top Tech 2017 issue, I wrote about various efforts to produce custom hardware tailored for performing deep-learning calculations. Prime among those is Google’s Tensor Processing Unit, or TPU, which Google has deployed in its data centers since early in 2015.
In that article, I speculated that the TPU was likely designed for performing what are called “inference” calculations. That is, it’s designed to quickly and efficiently calculate whatever it is that the neural-network it’s running was created to do. But that neural network would also have to be “trained,” meaning that its many parameters would be tuned to carry out the desired task. Training a neural network normally takes a different set of computational skills: In particular, training often requires the use of higher-precision arithmetic than does inference.
Yesterday, Google released a fairly detailed description of the TPU and its performance relative to CPUs and GPUs. I was happy to see that the surmise I had made in January was correct: The TPU is built for doing inference, having hardware that operates on 8-bit integers rather than higher-precision floating-point numbers.
Yesterday afternoon, David Patterson, an emeritus professor of computer science at the University of California, Berkeley and one of the co-authors of the report, presented these findings at a regional seminar of the National Academy of Engineering, held at the Computer History Museum in Menlo Park, Calif. The abstract for his talk summed up the main point nicely. It reads in part: “The TPU is an order of magnitude faster than contemporary CPUs and GPUs and its relative performance per watt is even larger.”
Google’s blog post about the release of the report shows how much of a difference in relative performance there can be, particularly in regard to energy efficiency. For example, compared with a contemporary GPU, the TPU is said to offer 83 times the performance per watt. That might be something of an exaggeration, because the report itself claims only that there’s a range of between 41 times and 83 times. And that’s for a quantity the authors call incremental performance. The range of improvement for total performance is less: from 14 to 16 times better for the TPU compared with that of a GPU.
The benchmark tests used to reach these conclusions are based on a half dozen of the actual kinds of neural-network programs that people are running at Google data centers. So it’s unlikely that anyone would critique these results on the basis of the tests not reflecting real-world circumstances. But it struck me that a different critique might well be in order.
The problem is this: These researchers are comparing their 8-bit TPU with higher-precision GPUs and CPUs, which are just not well suited to inference calculations. The GPU exemplar Google used in its report is Nvidia’s K80 board, which performs both single-precision (32-bit) and double-precision (64-bit) calculations. While they’re often important for training neural networks, such levels of precision aren’t typically needed for inference.
In my January story, I noted that Nvidia’s newer Pascal family of GPUs can perform “half-precision” (16-bit) operations and speculated that the company may soon produce units fully capable of 8-bit operations, in which case they might be much more efficient when carrying out inference calculations for neural-network programs.
The report’s authors anticipated such a criticism in the final section of their paper; there they considered the assertion (which they label a fallacy) that “CPU and GPU results would be comparable to the TPU if we used them more efficiently or compared to newer versions.” In discussing this point, they say they had tested only one CPU that could support 8-bit calculations, and the TPU was 3.5 times better. But they don’t really address the question of how GPU’s tailored for 8-bit calculations would fare—an important question if such GPUs soon became widely available.
Should that come to pass, I hope that these Googlers will re-run their benchmarks and let us know how TPUs and 8-bit-capable GPUs compare.
David Schneider is a senior editor at IEEE Spectrum. His beat focuses on computing, and he contributes frequently to Spectrum's Hands On column. He holds a bachelor's degree in geology from Yale, a master's in engineering from UC Berkeley, and a doctorate in geology from Columbia.