AI Training Is Outpacing Moore’s Law

The new set of MLPerf results proves it

4 min read

Rows of white cabinets in a white room with Nvidia symbol on the side.

Nvidia systems were on top again in the MLPerf training benchmarks.

Nvidia

The days and sometimes weeks it took to train AIs only a few years ago was a big reason behind the launch of billions of dollars-worth of new computing startups over the last few years—including Cerebras Systems, Graphcore, Habana Labs, and SambaNova Systems. In addition, Google, Intel, Nvidia and other established companies made their own similar amounts of internal investment (and sometimes acquisition). With the newest edition of the MLPerf training benchmark results, there’s clear evidence that the money was worth it.

The gains to AI training performance since MLPerf benchmarks began “managed to dramatically outstrip Moore’s Law,” says David Kanter, executive director of the MLPerf parent organization MLCommons. The increase in transistor density would account for a little more than doubling of performance between the early version of the MLPerf benchmarks and those from June 2021. But improvements to software as well as processor and computer architecture produced a 6.8-11-fold speedup for the best benchmark results. In the newest tests, called version 1.1, the best results improved by up to 2.3 times over those from June.

According to Nvidia the performance of systems using A100 GPUs has increased more than 5-fold in the last 18 months and 20-fold since the first MLPerf benchmarks three years ago.

For the first time Microsoft entered its Azure cloud AI offerings into MLPerf, muscling through all eight of the test networks using a variety of resources. They ranged in scale from 2 AMD Epyc CPUs and 8 Nvidia A100 GPUs to 512 CPUs and 2048 GPUs. Scale clearly mattered. The top range trained AIs in less than a minute while the two-and-eight combination often needed 20 minutes or more.

A line chart with 4 lines sweeping up to the right. Arrows highlight the gap between the bottom line and the top lines.Moore's Law can only do so much. Software and other progress have made the difference in training AIs.MLCommons

Nvidia worked closely with Microsoft on the benchmark tests, and, as in previous MLPerf lists, Nvidia GPUs were the AI accelerators behind most of the entries, including those from Dell, Inspur, and Supermicro. Nvidia itself topped all the results for commercially available systems, relying on the unmatched scale of its Selene AI supercomputer. Selene is made up of commercially available modular DGX SuperPod systems. In its most massive effort, Selene brought to bear 1080 AMD Epyc CPUs and 4320 A100 GPUs to train the natural language processor BERT in less than 16 seconds, a feat that took most smaller systems about 20 minutes.

According to Nvidia the performance of systems using A100 GPUs has increased more than 5-fold in the last 18 months and 20-fold since the first MLPerf benchmarks three years ago. That’s thanks to software innovation and improved networks, the company says. (For more, see Nvidia's blog.)

Given Nvidia’s pedigree and performance on these AI benchmarks, its natural for new competitors to compare themselves to it. That’s what UK-based Graphcore is doing when it notes that it’s base computing unit the Pod16—1 CPU and 16 IPU accelerators—beats the Nvidia base unit the DGX A100—2 CPUs and 8 GPUs—by nearly a minute.

Four computers side by side. The left-most one is short and wide. The next is tall, thin, and black. The next is two of the previous, and the final is four of the previous.Graphcore brought its bigger systems out to play.Graphcore

For this edition of MLPerf, Graphcore debuted image classification and natual language processing benchmarks for its combinations of those base units, the Pod64, Pod128, and (you saw this coming, right?) Pod256. The latter, made up of 32 CPUs and 256 IPUs, was the fourth fastest system behind Nvidia’s Selene and Intel’s Habana Gaudi to finish ResNet image classification training in 3:48. For natural language processing the Pod256 and Pod128 were third and fourth on the list, again behind Selene, finishing in 6:54 and 10:36. (For more see Graphcore's blog.)

You might have noticed that the CPU-to-accelerator chip ratios are quite different between Nvidia-based offerings—about 1 to 4—and Graphcore’s systems—as low as 1 to 32. That’s by design, say Graphcore engineers. The IPU is designed to depend less on a CPU’s control when operating neural networks.

You can see the opposite with Habana Labs, which Intel purchased for about US $2 billion in 2019. For example, for its high-ranking training on image classification, Intel used 64 Xeon CPUs and 128 Habana Gaudi accelerators to train ResNet in less than 5:30. It used 32 CPUs and 64 accelerators to train the BERT natural language neural net in 11:52. (For more, see Habana's blog.)

Google’s contribution to this batch of benchmark scores was a bit unusual. Rather than demonstrate commercial or cloud systems with the company’s TPU v4 processor technology, Google engineers submitted results for two hugely outsized natural language processing neural nets.

Using its publicly available TPU v4 cloud, the company ran a version of Lingvo, an NLP with a whopping 480-billion parameters compared to BERT’s 110 million. The cloud platform used 1024 AMD Epyc CPUs and 2048 TPUs to complete the training task in just under 20 hours. Using a research system consisting of 512 AMD Rome CPUs and 1024 TPUs Google trained a 200-billion parameter version of Lingvo in 13.5 hours. (It took 55 hours and 44 hours to do the whole process end-to-end including steps needed to get the training started, Google reports.)

Structurally, Lingvo is similar enough to BERT to fit into that category, but it also resembles other really the large conversation AI, such as LaMDA and GPT-3 ,that computing giants have been working on. Google thinks that huge-model training should eventually become a part of future MLPerf commercial benchmarks. (For more, see Google's blog.)

However, MLCommons’ Kanter, points out that the expense of training such systems is high enough to exclude many participants.

The Conversation (1)
James Brady
James Brady09 Dec, 2021
LF

The comparison to Moore's Law is a little weak as the 1965 form of Moore's Law required doubling each year and only improved by scaling. Whereas MLPerf(tm) gets improvement in many dimensions including: scaling, parallelism, algorithmic, and software pathlength reductions.

Otherwise it was interesting to see what was happening in this arena.