Recent developments in AI have been astounding, but so are the costs of training neural networks to do their astounding feats. The biggest, such as the language model
GPT-3 and the art generator DALL-E 2, take several months to train on a cluster of high-performance GPUs, costing millions of dollars and taking up millions of billions of billions of basic computations.

The training capabilities of processing units have been
growing rapidly, as much as doubling in the last year. To keep the trend going, researchers are digging down into the most basic building blocks of computation, the way computers represent numbers.

“We got a thousand times improvement [in training performance per chip] over the last 10 years, and a lot of it has been due to number representation,” Bill Dally, chief scientist and senior vice president of research at Nvidia said in his keynote talk at the recent IEEE Symposium on Computer Arithmetic.

Among the first casualties in the march toward more efficient AI training is the 32-bit floating-point number representation, colloquially known as “standard precision.” Seeking speed, energy efficiency, and better use of chip area and memory, machine-learning researchers have been trying to get the same level of training using numbers represented by fewer bits. The field is still wide open for competitors trying to take the place of the 32-bit format, both in the number representation itself and in the way basic arithmetic is done using those numbers. Here are some recent advances and top contenders unveiled last month at theconference:

- Nvidia’s per-vector scaling scheme
- Posits, a new kind of number
- Taking the risk out of RISC-V’s math
- BF16 is all you need

## Nvidia’s Per-Vector Scaling Scheme

Image creator DALL-E was trained on clusters of Nvidia’s A100 GPUs with a combination of the standard 32-bit numbers and the lower-precision 16-bit numbers. Nvidia’s recently announced Hopper GPU includes support for even smaller, 8-bit floating-point numbers. Now Nvidia has developed a chip prototype that takes this trend a step further by using a combination of 8-bit and 4-bit numbers.

The chip managed to maintain computational accuracy despite using less-precise numbers, at least for one part of the training process, called inferencing. Inference is performed on fully trained models to get an output, but it’s also done repeatedly during training.

“We end up getting 8-bit results from 4-bit precision,” Nvidia’s Dally told engineers and computer scientists.

Nvidia was able to cut down the size of numbers without significant accuracy loss thanks to a technique the company calls per-vector scaled quantization (VSQ). The basic idea goes something like this: A 4-bit number can represent only 16 values exactly. So, every number gets rounded to one of those 16 values. The loss of accuracy due to this rounding is called quantization error. However, you can add a scale factor to uniformly squeeze the 16 values closer together on the number line or pull them further apart, decreasing or increasing the quantization error.

Nvidia's per-vector scaling quantization (VSQ) scheme better represents the numbers needed in machine learning than standard formats such as INT4. Nvidia

The trick is to squeeze or expand those 16 values so they optimally match the range of numbers you actually need to represent in a neural network. This scaling is different for different sets of data. By fine-tuning this scaling parameter for every set of 64 numbers inside the neural-network model, Nvidia researchers were able to minimize quantization error. The overhead from calculating the scaling factors is negligible, they found. But the energy efficiency doubled as the 8-bit representation was reduced to 4-bit.

The experimental chip is still under development, and Nvidia engineers are working on how to use these principles during the full training process, not just in inferencing. If successful, a chip combining 4-bit computations, VSQ, and other efficiency improvements could achieve 10 times the number of operations per watt as the Hopper GPU, said Dally.

## Posits, a New Kind of Number

Back in 2017, two researchers, John Gustafson and Isaac Yonemoto, developed an entirely new way of representing numbers.

Now, a team of researchers at the Complutense University of Madrid have developed the first processor core implementing the posit standard in hardware and showed that, bit for bit, the accuracy of a basic computational task increased by up to four orders of magnitude, as compared to computing using standard floating-point numbers.

The advantage of posits comes from the way the numbers they represent exactly are distributed along the number line. In the middle of the number line, around 1 and –1, there are more posit representations than floating point. And at the wings, going out to large negative and positive numbers, posit accuracy falls off more gracefully than floating point.

“It’s a better match for the natural distribution of numbers in a calculation,” says Gustafson. “It’s the right dynamic range, and it’s the right accuracy where you need more accuracy. There’s an awful lot of bit patterns in floating-point arithmetic no one ever uses. And that’s a waste.”

Posits accomplish this improved accuracy around 1 and –1 thanks to an extra component in their representation. Floats are made up of three parts: a sign bit (0 for positive, 1 for negative), several “mantissa” (fraction) bits denoting what comes after the binary version of a decimal point, and the remaining bits defining the exponent (2
^{exp}).

Posits keep all the components of a float but add an extra “regime” section, an exponent of an exponent. The beauty of the regime is that it can vary in bit length. For small numbers, it can take as few as two bits, leaving more precision for the mantissa. This allows for the higher accuracy of posits in their sweet spot around 1 and –1.

Posits differ from floating-point numbers by the addition of an extra variable-length regime section. This gives posits a better accuracy around zero, where the majority of numbers used in neural networks reside. Complutense University of Madrid/IEEE

With their new hardware implementation, which was synthesized on a field-programmable gate array (FPGA), the Complutense team was able to compare computations done using 32-bit floats and 32-bit posits side by side. They assessed their accuracy by comparing them to results using the much more accurate but computationally costly 64-bit floating-point format. Posits showed an astounding four-order-of-magnitude improvement in the accuracy of matrix multiplication, a crucial calculation in neural networks. They also found that the improved accuracy didn’t come at the cost of computation time, only a somewhat increased chip area and power consumption.

## Making RISC-V’s Math Less Risky

Computer scientists in Switzerland and Italy have worked out a bit-reducing scheme, meant for processors using open-source RISC-V instruction-set architecture, which is gaining ground among new processor developers.

The team’s extension to the RISC-V instruction set includes an efficient version of computations that mix lower- and higher-precision number representations. With their improved mixed-precision math, they obtain a factor-of-two speedup in basic computations involved in training neural networks.

Lowering precision has knock-on effects during basic operations that go beyond just the loss of accuracy due to the shorter number. Multiplying two low-precision numbers can result in a number that is too small or too large to represent given the bit length—phenomena called underflow and overflow, respectively. A separate phenomenon, called swamping, happens when you add a large low-precision number and a small low-precision number. The result can be the complete loss of the smaller number.

Mixed precision comes to the rescue in mitigating overflow, underflow, and swamping. Computations are performed with low-precision inputs and result in higher-precision outputs, getting a batch of math done before rounding back to lower precision.

The Europe-based team focused on a basic component of AI computations: the dot product. It’s usually implemented in hardware with a series of components called fused multiply-add units (FMAs). These perform the operation *d* = *a***b* + *c* in one go, only rounding at the end. To reap the benefits of mixed precision, inputs *a* and *b* are low precision (say, 8 bits), while *c* and the output *d *are high precision (say, 16 bits).

Luca Benini, chair of digital circuits and systems at ETH Zurich and professor at the University of Bologna and his team had a simple insight: Instead of doing one FMA operation at a time, why not do two in parallel and add them together at the end? Not only does this prevent losses due to rounding between the two FMAs, but it also allows for better utilization of memory, because no memory registers are left waiting for the previous FMA to complete.

The group designed and simulated the parallel-mixed precision dot-product unit and found the computation time was nearly cut in half and the output accuracy improved, especially for dot products of large vectors. They are currently building their new architecture in silicon to prove the simulation’s predictions.

## Brain Float 16 Can Do It All

Engineers at the Barcelona Supercomputing Center and Intel claim that “Brain Float 16 is all you need” for deep neural-network training. And no, they are not Beatles fans. They are fans of the number format Brain Float—Google Brain’s brainchild.

Brain Float (BF16) is a pared-down version of the standard 32-bit floating-point representation (FP32). FP32 represents numbers with one sign bit, 8 exponent bits, and 23 mantissa bits. A BF16 number is just an FP32 number with 16 mantissa bits chopped off. This way, BF16 sacrifices precision but takes up half the space in memory and preserves the dynamic range of FP32.

Brain Floats are already in wide use in AI training, including the training of the image creator DALL-E. However, they’ve always been used in combination with higher-precision floating-point numbers, converting back and forth between the two for different parts of the computation. That’s because while some parts of neural-network training don’t suffer much from the reduced precision of BF16, others do. In particular, small-magnitude values can get lost during fused multiply-add computations.

The Barcelona and Intel team has found a way around these issues and showed that BF16 can be used exclusively to train state-of-the-art AI models. What’s more, in simulation, the training used one-third of the chip area and energy FP32 would need.

To perform AI training entirely using BF16 without having to worry about losing small values, the Barcelona team developed both an extended-number format and a bespoke FMA unit. The number format, called BF16-N, combines several BF16 numbers to represent one real number. Far from defeating the point of using fewer-bit formats, combining BF16 allows for much more efficient FMA operations without sacrificing significant precision. The key to that counterintuitive result lies in the silicon area (and therefore power) requirements of FMA units. Those requirements square with the number of mantissa bits. An FMA for FP32 numbers, with their 23 mantissas, would require 576 units of area. But a clever FMA yielding comparable results with BF16-N where N is 2 takes 192 units.

Three 16-bit Brain Float numbers can carry the same precision as 32-bit floats while saving chip area during computations. John Osorio/Barcelona Supercomputing Center

Eliminating the need for mixed precision and focusing exclusively on BF16 operations opens the possibility for much cheaper hardware, says Marc Casas, senior researcher at the Barcelona Supercomputing Center and leader of this effort. For now, they have emulated their solution only in software. Now they are working on implementing the Brain-Float-only approach on an FPGA to see if the performance improvements hold.

*Portions of this article already appeared in a previous post. *