Cerebras’s Giant Chip Will Smash Deep Learning’s Speed Barrier
Computers using Cerebras’s chip will train these AI systems in hours instead of weeks
The problem, as he and his fellow Cerebras founders see it, is that today's artificial neural networks are too time-consuming and compute-intensive to train. For, say, a self-driving car to recognize all the important objects it will encounter on the road, the car's neural network has to be shown many, many images of all those things. That process happens in a data center where computers consuming tens or sometimes hundreds of kilowatts are dedicated to what is too often a weeks-long task. Assuming the resulting network can carry out the task with the needed accuracy, the many coefficients that define the strength of connections in the network are then downloaded to the car's computer, which performs the other half of deep learning, called inference.
Cerebras's customers—and it already has some, despite emerging from stealth mode only this past summer—complain that training runs for big neural networks on today's computers can take as long as six weeks. At that rate, they are able to train only maybe six neural networks in a year. “The idea is to test more ideas," says Feldman. “If you can [train a network] instead in 2 or 3 hours, you can run thousands of ideas."
When IEEE Spectrum visited Cerebras's headquarters in Los Altos, Calif., those customers and some potential new ones were already pouring their training data into four CS-1 computers through orange-jacketed fiber-optic cables. These 64-centimeter-tall machines churned away, while the heat exhaust of the 20 kilowatts being consumed by each blew out into the Silicon Valley streets through a hole cut into the wall.
The CS-1 computers themselves weren't much to look at from the outside. Indeed, about three-quarters of each chassis is taken up with the cooling system. What's inside that last quarter is the real revolution: a hugely powerful computer made up almost entirely of a single chip. But that one chip extends over 46,255 square millimeters—more than 50 times the size of any other processor chip you can buy. With 1.2 trillion transistors, 400,000 processor cores, 18 gigabytes of SRAM, and interconnects capable of moving 100 million billion bits per second, Cerebras's Wafer Scale Engine (WSE) defies easy comparison with other systems.
The statistics Cerebras quotes are pretty astounding. According to the company, a 10-rack TPU2 cluster—the second of what are now three generations of Google AI computers—consumes five times as much power and takes up 30 times as much space to deliver just one-third of the performance of a single computer with the WSE. Whether a single massive chip is really the answer the AI community has been waiting for should start to become clear this year. “The [neural-network] models are becoming more complex," says Mike Demler, a senior analyst with the Linley Group, in Mountain View, Calif. “Being able to quickly train or retrain is really important."
Customers such as supercomputing giant Argonne National Laboratory, near Chicago, already have the machines on their premises, and if Cerebras's conjecture is true, the number of neural networks doing amazing things will explode.
When the founders of Cerebras—veterans of Sea Micro, a server business acquired by AMD—began meeting in 2015, they wanted to build a computer that perfectly fit the nature of modern AI workloads, explains Feldman. Those workloads are defined by a few things: They need to move a lot of data quickly, they need memory that is close to the processing core, and those cores don't need to work on data that other cores are crunching.
This suggested a few things immediately to the company's veteran computer architects, including Gary Lauterbach, its chief technical officer. First, they could use thousands and thousands of small cores designed to do the relevant neural-network computations, as opposed to fewer more general-purpose cores. Second, those cores should be linked together with an interconnect scheme that moves data quickly and at low energy. And finally, all the needed data should be on the processor chip, not in separate memory chips.
The need to move data to and from these cores was, in large part, what led to the WSE's uniqueness. The fastest, lowest-energy way to move data between two cores is to have them on the same silicon substrate. The moment data has to travel from one chip to another, there's a huge cost in speed and power because distances are longer and the “wires" that carry the signals must be wider and less densely packed.
The drive to keep all communications on silicon, coupled with the desire for small cores and local memory, all pointed to making as big a chip as possible, maybe one as big as a whole silicon wafer. “It wasn't obvious we could do that, that's for sure," says Feldman. But “it was fairly obvious that there were big benefits."
The Software Side of Cerebras
- Cerebras unveiled some details of the software side of the system in Denver at the supercomputing conference SC19 in November. The CS-1's software allows users to write their machine learning models using standard frameworks such as PyTorch and TensorFlow. It then sets about devoting variously sized portions of Cerebras's Wafer Scale Engine chip to layers of the neural network. It does this by solving an optimization problem that ensures that the layers all complete their work at roughly the same pace and are contiguous with their neighbors, so information can flow through the network without any holdups. The software can perform that optimization problem across multiple computers, allowing a cluster of computers to act as one big machine. Cerebras has linked as many as 16 CS-1s together to get a roughly 16-fold performance increase. This contrasts with how clusters based on graphics processing units behave, says Feldman. “Today, when you cluster GPUs, you don't get the behavior of one big machine. You get the behavior of lots of little machines." —S.K.M.
For decades, engineers had assumed that a wafer-scale chip was a dead end. After all, no less a luminary than the late Gene Amdahl, chief architect of the IBM System/360 mainframe, had tried and failed spectacularly at it with a company called Trilogy Systems. But Lauterbach and Feldman say that any comparison with Amdahl's attempt is laughably out-of-date. The wafers Amdahl was working with were one-tenth the size of today's, and features that made up devices on those wafers were 30 times the size of today's.
More important, Trilogy had no way of handling the inevitable errors that arise in chip manufacturing. Everything else being equal, the likelihood of there being a defect increases as the chip gets larger. If your chip is nearly the size of a sheet of letter-size paper, then you're pretty much asking for it to have defects.
But Lauterbach saw an architectural solution: Because the workload they were targeting favors having thousands of small, identical cores, it was possible to fit in enough redundant cores to account for the defect-induced failure of even 1 percent of them and still have a very powerful, very large chip.
Of course, Cerebras still had to solve a host of manufacturing issues to build its defect-tolerant giganto chip. For example, photolithography tools are designed to cast their feature-defining patterns onto relatively small rectangles, and to do that over and over. That limitation alone would keep a lot of systems from being built on a single wafer, because of the cost and difficulty of casting different patterns in different places on the wafer.
But the WSE doesn't require that. It resembles a typical wafer full of the exact same chips, just as you'd ordinarily manufacture. The big challenge was finding a way to link those pseudochips together. Chipmakers leave narrow edges of blank silicon called scribe lines around each chip. The wafer is typically diced up along those lines. Cerebras worked with Taiwan Semiconductor Manufacturing Co. (TSMC) to develop a way to build interconnects across the scribe lines so that the cores in each pseudochip could communicate.
With all communications and memory now on a single slice of silicon, data could zip around unimpeded, producing a core-to-core bandwidth of 1,000 petabits per second and an SRAM-to-core bandwidth of 9 petabytes per second. “It's not just a little more," says Feldman. “It's four orders of magnitude greater bandwidth, because we stay on silicon."
Scribe-line-crossing interconnects weren't the only invention needed. Chip-manufacturing hardware had to be modified. Even the software for electronic design automation had to be customized for working on such a big chip. “Every rule and every tool and every manufacturing device was designed to pick up a normal-sized chocolate chip cookie, and [we] delivered something the size of the whole cookie sheet," says Feldman. “Every single step of the way, we have to invent."
Wafer-scale integration “has been dismissed for the last 40 years, but of course, it was going to happen sometime," he says. Now that Cerebras has done it, the door may be open to others. “We think others will seek to partner with us to solve problems outside of AI."
Cerebras Inside: The cooling system takes up most of the CS-1. The WSE chip is in the back left corner.Photo: Cerebras Systems
Indeed, engineers at the University of Illinois and the University of California, Los Angeles, see Cerebras's chip as a boost to their own wafer-scale computing efforts using a technology called silicon-interconnect fabric [see “Goodbye, Motherboard. Hello, Silicon-Interconnect Fabric," IEEE Spectrum, October 2019]. “This is a huge validation of the research we've been doing," says the University of Illinois's Rakesh Kumar. “We like the fact that there is commercial interest in something like this."
The CS-1 is more than just the WSE chip, of course, but it's not much more. That's both by design and necessity. What passes for the motherboard is a power-delivery system that sits above the chip and a water-cooled cold plate below it. Surprisingly enough, it was the power-delivery system that was the biggest challenge in the computer's development.
The WSE's 1.2 trillion transistors are designed to operate at about 0.8 volts, pretty standard for a processor. There are so many of them, though, that in all they need 20,000 amperes of current. “Getting 20,000 amps into the wafer without significant voltage drop is quite an engineering challenge—much harder than cooling it or addressing the yield problems," says Lauterbach.
Power can't be delivered from the edge of the WSE, because the resistance in the interconnects would drop the voltage to zero long before it reached the middle of the chip. The answer was to deliver it vertically from above. Cerebras designed a fiberglass circuit board holding hundreds of special-purpose chips for power control. One million copper posts bridge the millimeter or so from the fiberglass board to points on the WSE.
Delivering power in this way might seem straightforward, but it isn't. In operation, the chip, the circuit board, and the cold plate all warm up to the same temperature, but they expand when doing so by different amounts. Copper expands the most, silicon the least, and the fiberglass somewhere in between. Mismatches like this are a headache in normal-size chips because the change can be enough to shear away their connection to a printed circuit board or produce enough stress to break the chip. For a chip the size of the WSE, even a small percentage change in size translates to millimeters.
“The challenge of [coefficient of thermal expansion] mismatch with the motherboard was a brutal problem," says Lauterbach. Cerebras searched for a material with the right intermediate coefficient of thermal expansion, something between those of silicon and fiberglass. Only that would keep the million power-delivery posts connected. But in the end, the engineers had to invent one themselves, an endeavor that took a year and a half to accomplish.
The WSE is obviously bigger than competing chips commonly used for neural-network calculations, like the Nvidia Tesla V100 graphics processing unit or Google's Tensor Processing Unit. But is it better?
In 2018, Google, Baidu, and some top academic groups began working on benchmarks that would allow apples-to-apples comparisons among systems. The result, MLPerf, released training benchmarks in May 2018.
According to those benchmarks, the technology for training neural networks has made some huge strides in the last few years. On the ResNet-50 image-classification problem, the Nvidia DGX SuperPOD—essentially a 1,500-GPU supercomputer—finished in 80 seconds. It took 8 hours on Nvidia's DGX-1 machine (circa 2017) and 25 days using the company's K80 from 2015.
Cerebras hasn't released MLPerf results or any other independently verifiable apples-to-apples comparisons. Instead the company prefers to let customers try out the CS-1 using their own neural networks and data.
This approach is not unusual, according to analysts. “Everybody runs their own models that they developed for their own business," says Karl Freund, an AI analyst at Moor Insights. “That's the only thing that matters to buyers."
Early customer Argonne National Labs, for one, has some pretty intense needs. In training a neural network to recognize, in real time, different types of gravitational-wave events, scientists recently used one-quarter of the resources of Argonne's megawatt-consuming Theta supercomputer, the 28th most powerful system in the world.
Cutting power consumption down to mere kilowatts seems like a key benefit in supercomputing. Unfortunately, Lauterbach doubts that this feature will be much of a selling point in data centers. “While a lot of data centers talk about [conserving] power, when it comes down to it...they don't care," he says. “They want performance." And that's something a processor nearly the size of a dinner plate can certainly provide.
This article appears in the January 2020 print issue as “Huge Chip Smashes Deep Learning's Speed Barrier."