The New Math of Machine Learning --> New number formats and fundamental computations emerge to accelerate AI training

The New Math of Machine Learning —> New number formats and fundamental computations emerge to accelerate AI training


Recent advances in AI have been astounding, but so have the costs of training neural networks to perform such feats. The most complicated ones, like the language model GPT-3 and the art generator DALL-E 2, take months to train on a cluster of high-performance GPUs, cost millions of dollars, and use millions of billions of basic computations.

Processing unit training capabilities have been rapidly expanding, nearly doubling in the last year. To keep up with the trend, researchers are studying the way computers represent numbers, which is one of the most basic building blocks of computation.

In his keynote address at the recent IEEE Symposium on Computer Arithmetic, Bill Dally, Nvidia's chief scientist and senior vice president of research, said "We got a thousand times improvement [in training performance per chip] over the last ten years, and a lot of it has been due to number representation."

The 32-bit floating-point number representation, colloquially known as "standard precision," was among the first casualties in the march toward more efficient AI training. Machine-learning researchers have been attempting to achieve the same level of training using numbers represented by fewer bits in order to improve speed, energy efficiency, and better use of chip area and memory. For competitors attempting to replace the 32-bit format, the field is still wide open, both in terms of number representation and how basic arithmetic is performed with those numbers.Here are some recent advancements and top contenders revealed at the conference last month:


The per-vector scaling scheme from Nvidia
Posits is a new type of number. 
All you need to do is remove the risk from RISC-math V's BF16. 
The Per-Vector Scaling Scheme from Nvidia
DALL-E, the image creator, was trained on clusters of Nvidia's A100 GPUs using a combination of standard 32-bit and lower-precision 16-bit numbers. The recently announced Nvidia Hopper GPU supports even smaller, 8-bit floating-point numbers. Nvidia has now created a chip prototype that takes this trend a step further by combining 8-bit and 4-bit numbers.

Despite using less precise numbers, the chip was able to maintain computational accuracy, at least for one part of the training process known as inferencing. Inference is done on fully trained models to get an output, but it is also done many times during training.

"We end up with 8-bit results from 4-bit precision," Dally explained to engineers and computer scientists at Nvidia.

Nvidia was able to reduce the size of numbers without sacrificing significant accuracy by using a technique known as per-vector scaled quantization (VSQ). The fundamental concept is as follows: A 4-bit number can only represent 16 distinct values. As a result, every number is rounded to one of those 16 values. The loss of accuracy caused by rounding is known as quantization error. But you can use a scale factor to push or pull the 16 values on the number line in the same way, making the quantization error smaller or bigger.

Two humps of light blue with dark blue regions and black vertical lines

Nvidia's per-vector scaling quantization (VSQ) scheme represents machine learning numbers better than standard formats such as INT4. NVIDIA 
The trick is to squeeze or expand those 16 values so that they best match the range of numbers that a neural network needs to represent. This scaling varies depending on the data set. Nvidia researchers were able to minimize quantization error by fine-tuning this scaling parameter for each set of 64 numbers within the neural-network model. They discovered that the overhead from calculating the scaling factors is negligible. However, when the 8-bit representation was reduced to 4-bit, the energy efficiency doubled.

The experimental chip is still in development, and Nvidia engineers are figuring out how to apply these principles throughout the entire training process, not just inference. Dally says that a chip that uses 4-bit computations, VSQ, and other ways to improve efficiency could do 10 times as many operations per watt as the Hopper GPU if it works.

Posits, a Novel Type of Number 
Two researchers, John Gustafson and Isaac Yonemoto, created an entirely new way of representing numbers in 2017.

Now, researchers at the Complutense University of Madrid have created the first processor core that implements the posit standard in hardware, demonstrating that the accuracy of a basic computational task can be increased by up to four orders of magnitude when compared to computing with standard floating-point numbers.

Posits have an advantage because the numbers they represent are distributed exactly along the number line. There are more posit representations than floating point representations in the middle of the number line, between 1 and -1. And, when it comes to large negative and positive numbers in the wings, posit accuracy falls off more gracefully than floating point.


"It matches the natural distribution of numbers in a calculation better," Gustafson says. "It has the right dynamic range and the right accuracy where you need it." There are many bit patterns in floating-point arithmetic that are never used. That is a waste. "

Posits achieve this improved accuracy between 1 and -1 by including an extra component in their representation. Floats are made up of three parts: a sign bit (0 for positive, 1 for negative), several "mantissa" (fraction) bits that show what comes after the binary version of a decimal point, and the remaining bits that define the exponent (2 exp).

Posits retain all of the components of a float but add a "regime" section, an exponent of an exponent. The regime's beauty is that it can vary in bit length. It can take as few as two bits for small numbers, leaving more precision for the mantissa. This enables higher accuracy of posits in their sweet spot between 1 and -1.

A graph with binary accuracy bits on the y-axis and exponent on the x axis with blue IEEE float 32 and magenta Posit (32,2) curves.

Posits differ from floating-point numbers in that they have an additional variable-length regime section. This improves accuracy around zero, where the vast majority of numbers used in neural networks reside. IEEE/MADRID COMPLEX UNIVERSITY
The Complutense team was able to compare computations done with 32-bit floats and 32-bit posits side by side using their new hardware implementation, which was synthesized on a field-programmable gate array (FPGA). They compared their accuracy to results obtained using the much more accurate but computationally expensive 64-bit floating-point format. Posits demonstrated a four-order-of-magnitude improvement in matrix multiplication accuracy, a critical calculation in neural networks. They also found that getting more accurate didn't slow down the computer, but it did take up a little more space on the chip and use a little more power.

Making RISC-Math V Less Dangerous 
Swiss and Italian computer scientists came up with a way to cut down on the number of bits in processors. It is based on the open-source RISC-V instruction-set architecture, which is becoming more popular with people who make new processors.

The team's RISC-V instruction set extension includes an efficient version of computations that combines lower-and higher-precision number representations. With their improved mixed-precision math, they can speed up basic calculations used in training neural networks by a factor of two.

Lowering precision has consequences during basic operations that go beyond the loss of accuracy caused by the shorter number. When two low-precision numbers are multiplied, the result can be a number that is too small or too large to represent given the bit length—a phenomenon known as underflow and overflow. When you add a large low-precision number and a small low-precision number, a separate phenomenon known as "swamping" occurs. As a result, the smaller number may be completely lost.

In the cases of overflow, underflow, and swamping, mixed precision comes to the rescue. Computations start with low-precision inputs and end with higher-precision outputs. This lets a batch of math be done before the precision is rounded down.

The team from Europe concentrated on a fundamental component of AI computations: the dot product. It is typically implemented in hardware as a series of fused multiply-add units (FMAs). These do the operation d = a*b + c all at once, only rounding at the end. To take advantage of mixed precision, inputs a and b should be low precision (say, 8 bits), while c and the output d should be high precision (say, 16 bits).

Luca Benini, chair of digital circuits and systems at ETH Zurich and professor at the University of Bologna, and his team had a simple idea: why not perform two FMA operations in parallel and add them together at the end? This keeps from losing data because of rounding between the two FMAs, and it also makes better use of memory because no memory registers are left empty while the last FMA finishes.

The group designed and simulated the parallel-mixed precision dot-product unit and discovered that the computation time was nearly cut in half and the output accuracy improved, especially for large vector dot products. They are making their new silicon architecture right now to see if the simulation's predictions are correct.

Engineers at the Barcelona Supercomputing Center and Intel claim that for deep neural-network training, "Brain Float 16 is all you need." And, no, they are not fans of the Beatles. They like the number format Brain Float, which was created by Google Brain.

The Brain Float (BF16) format is a simplified version of the standard 32-bit floating-point representation (FP32). FP32 has one sign bit, eight exponent bits, and 23 mantissa bits to represent numbers. A BF16 number is simply an FP32 number with 16 mantissa bits removed. In this way, BF16 gives up precision but uses only half as much memory as FP32 and keeps the same range of values.

Brain floats are already widely used in AI training, including the DALL-E image creator. They have, however, always been used in conjunction with high-precision floating-point numbers, converting between the two for various parts of the computation. This is because, while some aspects of neural-network training are unaffected by BF16's reduced precision, others are. Small-magnitude values, in particular, can be lost during fused multiply-add computations.

The Barcelona and Intel teams worked around these issues and demonstrated that BF16 can be used exclusively to train cutting-edge AI models. In simulations, the training took up one-third of the space on the chip and one-third of the energy that the FP32 needed.

To perform AI training entirely in BF16 without fear of losing small values, the Barcelona team created an extended-number format as well as a custom FMA unit. The BF16-N number format combines several BF16 numbers to represent one real number. Rather than defeating the purpose of using fewer-bit formats, combining BF16 enables much more efficient FMA operations without sacrificing significant precision. The silicon area (and thus power) requirements of FMA units are the key to that counterintuitive result. These requirements are proportional to the number of mantissa bits. With 23 mantissas, an FMA for FP32 numbers would require 576 units of area. However, a clever FMA that produces comparable results to BF16-N where N is 2, requires 192 units.

One dark blue S, eight green E's, and twenty-three yellow M's are followed by three sets of 16 colored squares with one dark blue S, eight green E's, and seven M's. Three 16-bit Brain Float numbers can have the same precision as 32-bit floats while taking up less chip space during computations. BARCELONA/JOHN OSORIO 
Eliminating the need for mixed precision and focusing solely on BF16 operations allows for much cheaper hardware, according to Marc Casas, senior researcher at the Barcelona Supercomputing Center and project leader. For the time being, they have only emulated their solution in software. They are now working on putting the Brain-Float-Only method on an FPGA to see if the performance gains can be kept up. 

Post a Comment

Previous Post Next Post