The New Math of Machine Learning —> New number formats and fundamental computations emerge to accelerate AI training
Recent advances in AI have been astounding, but so have the costs of training neural networks to perform such feats. The most complicated ones, like the language model GPT-3 and the art generator DALL-E 2, take months to train on a cluster of high-performance GPUs, cost millions of dollars, and use millions of billions of basic computations.
Processing unit training capabilities have been rapidly expanding, nearly doubling in the last year. To keep up with the trend, researchers are studying the way computers represent numbers, which is one of the most basic building blocks of computation.
In his keynote address at the recent IEEE Symposium on Computer Arithmetic, Bill Dally, Nvidia's chief scientist and senior vice president of research, said "We got a thousand times improvement [in training performance per chip] over the last ten years, and a lot of it has been due to number representation."
The 32-bit floating-point number representation, colloquially known as "standard precision," was among the first casualties in the march toward more efficient AI training. Machine-learning researchers have been attempting to achieve the same level of training using numbers represented by fewer bits in order to improve speed, energy efficiency, and better use of chip area and memory. For competitors attempting to replace the 32-bit format, the field is still wide open, both in terms of number representation and how basic arithmetic is performed with those numbers.Here are some recent advancements and top contenders revealed at the conference last month:
Despite using less precise numbers, the chip was able to maintain computational accuracy, at least for one part of the training process known as inferencing. Inference is done on fully trained models to get an output, but it is also done many times during training.
"We end up with 8-bit results from 4-bit precision," Dally explained to engineers and computer scientists at Nvidia.
Nvidia was able to reduce the size of numbers without sacrificing significant accuracy by using a technique known as per-vector scaled quantization (VSQ). The fundamental concept is as follows: A 4-bit number can only represent 16 distinct values. As a result, every number is rounded to one of those 16 values. The loss of accuracy caused by rounding is known as quantization error. But you can use a scale factor to push or pull the 16 values on the number line in the same way, making the quantization error smaller or bigger.
Two humps of light blue with dark blue regions and black vertical lines
The experimental chip is still in development, and Nvidia engineers are figuring out how to apply these principles throughout the entire training process, not just inference. Dally says that a chip that uses 4-bit computations, VSQ, and other ways to improve efficiency could do 10 times as many operations per watt as the Hopper GPU if it works.
Now, researchers at the Complutense University of Madrid have created the first processor core that implements the posit standard in hardware, demonstrating that the accuracy of a basic computational task can be increased by up to four orders of magnitude when compared to computing with standard floating-point numbers.
Posits have an advantage because the numbers they represent are distributed exactly along the number line. There are more posit representations than floating point representations in the middle of the number line, between 1 and -1. And, when it comes to large negative and positive numbers in the wings, posit accuracy falls off more gracefully than floating point.
Posits achieve this improved accuracy between 1 and -1 by including an extra component in their representation. Floats are made up of three parts: a sign bit (0 for positive, 1 for negative), several "mantissa" (fraction) bits that show what comes after the binary version of a decimal point, and the remaining bits that define the exponent (2 exp).
Posits retain all of the components of a float but add a "regime" section, an exponent of an exponent. The regime's beauty is that it can vary in bit length. It can take as few as two bits for small numbers, leaving more precision for the mantissa. This enables higher accuracy of posits in their sweet spot between 1 and -1.
A graph with binary accuracy bits on the y-axis and exponent on the x axis with blue IEEE float 32 and magenta Posit (32,2) curves.
The team's RISC-V instruction set extension includes an efficient version of computations that combines lower-and higher-precision number representations. With their improved mixed-precision math, they can speed up basic calculations used in training neural networks by a factor of two.
Lowering precision has consequences during basic operations that go beyond the loss of accuracy caused by the shorter number. When two low-precision numbers are multiplied, the result can be a number that is too small or too large to represent given the bit length—a phenomenon known as underflow and overflow. When you add a large low-precision number and a small low-precision number, a separate phenomenon known as "swamping" occurs. As a result, the smaller number may be completely lost.
In the cases of overflow, underflow, and swamping, mixed precision comes to the rescue. Computations start with low-precision inputs and end with higher-precision outputs. This lets a batch of math be done before the precision is rounded down.
The team from Europe concentrated on a fundamental component of AI computations: the dot product. It is typically implemented in hardware as a series of fused multiply-add units (FMAs). These do the operation d = a*b + c all at once, only rounding at the end. To take advantage of mixed precision, inputs a and b should be low precision (say, 8 bits), while c and the output d should be high precision (say, 16 bits).
Luca Benini, chair of digital circuits and systems at ETH Zurich and professor at the University of Bologna, and his team had a simple idea: why not perform two FMA operations in parallel and add them together at the end? This keeps from losing data because of rounding between the two FMAs, and it also makes better use of memory because no memory registers are left empty while the last FMA finishes.
The group designed and simulated the parallel-mixed precision dot-product unit and discovered that the computation time was nearly cut in half and the output accuracy improved, especially for large vector dot products. They are making their new silicon architecture right now to see if the simulation's predictions are correct.
Engineers at the Barcelona Supercomputing Center and Intel claim that for deep neural-network training, "Brain Float 16 is all you need." And, no, they are not fans of the Beatles. They like the number format Brain Float, which was created by Google Brain.
The Brain Float (BF16) format is a simplified version of the standard 32-bit floating-point representation (FP32). FP32 has one sign bit, eight exponent bits, and 23 mantissa bits to represent numbers. A BF16 number is simply an FP32 number with 16 mantissa bits removed. In this way, BF16 gives up precision but uses only half as much memory as FP32 and keeps the same range of values.
Brain floats are already widely used in AI training, including the DALL-E image creator. They have, however, always been used in conjunction with high-precision floating-point numbers, converting between the two for various parts of the computation. This is because, while some aspects of neural-network training are unaffected by BF16's reduced precision, others are. Small-magnitude values, in particular, can be lost during fused multiply-add computations.
The Barcelona and Intel teams worked around these issues and demonstrated that BF16 can be used exclusively to train cutting-edge AI models. In simulations, the training took up one-third of the space on the chip and one-third of the energy that the FP32 needed.
To perform AI training entirely in BF16 without fear of losing small values, the Barcelona team created an extended-number format as well as a custom FMA unit. The BF16-N number format combines several BF16 numbers to represent one real number. Rather than defeating the purpose of using fewer-bit formats, combining BF16 enables much more efficient FMA operations without sacrificing significant precision. The silicon area (and thus power) requirements of FMA units are the key to that counterintuitive result. These requirements are proportional to the number of mantissa bits. With 23 mantissas, an FMA for FP32 numbers would require 576 units of area. However, a clever FMA that produces comparable results to BF16-N where N is 2, requires 192 units.