I’m working on optimizing my machine learning models and I’ve come across two different formats: BF16 and FP16. Can someone explain the key differences between BF16 and FP16? Am keen on:

Performance differences in training and inference

Precision and accuracy implications

Hardware compatibility and support

Use cases where one might be preferred over the other

Any insights or experiences with these formats would be really helpful.

During training, BFloat16 provides more stability than FP16. Because Google uses TPUs, most of their models are BFloat16, whereas BF16 is native. More LLMs trained in BFloat16 are being observed because of their higher stability (see HuggingFace’s BigScience experiment, which highlighted improved stability). One advantage of BF16 (as opposed to FP16) is that no gradient scaling is required.

Because both FP16 and BF16 on the A100 GPU rely on the same amount of bits, theoretical performance for both should be the same in terms of memory. Performance seems to still depend on the underlying operators used, though, as it’s very just added to PyTorch (pytorch lightning debugging in progress here).

BFloat16 provides better stability during training compared to FP16. Many of Google’s models use BFloat16 because their TPUs natively support it. We’re now seeing more large language models (LLMs) being trained with BFloat16 because of its superior stability, as noted by HuggingFace in their BigScience project. A key advantage of BFloat16 is that it doesn’t require gradient scaling, which is usually needed with FP16.

For the A100 GPU, the theoretical performance of FP16 and BFloat16 is the same, as both use the same number of bits and should have similar memory usage. However, since BFloat16 is relatively new to PyTorch, its performance may still depend on the underlying operators being used (as ongoing debugging in PyTorch Lightning shows).

This blog post provides a good explanation of BFloat16 and why it’s favored in situations where stability is crucial.

Both consume the exact same memory as they encode each number on 16 bits.

On recent Nvidia GPU (Ampere generation like A100 and 3090 RTX), tensor cores boost both of them. On older ones (like a V100 or a T4), bfloat16 is not supported so life is easier because you have no choice. Google TPU supports BF16 since quite some time.The diff between them is in the number of bits for the exponent part and the mantissa (see Wikipedia bfloat16 floating-point format - Wikipedia).

FP16 has 5 bits for the exponent, meaning it can encode numbers between -65K and +65.BF16 has as 8 bits in exponent like FP32, meaning it can approximately encode as big numbers as FP32.

During training in mixed precision, when values are too big to be encoded in FP16 (>65K or <-65K), there is a trick applied to rescale the gradient. However, it seems that on super large models (the GPT3 likes), it makes nnet unstable.

BF16 is not perfect either, as it’s really less precise than FP32. One bad thing which may happen is that a value very close to 0 can’t be encoded and is rounded to 0 (same with FP16 but worth in BF16). It’s an issue when, for instance, you plan to divide something with this 0

Another bad thing IRL is that your model may contain large values and may require work if you plan to perform inference on a hardware which doesn’t support bf16. It’s still doable. For instance, T5 model from Google is known for requiring work to make it work in FP16.

More exponent bits mean you can also represent numbers closer to zero. BF16 can show much smaller numbers than FP16 before rounding to zero. The smallest number BF16 can represent is 9.18e-41, while the smallest FP16 can represent is 5.96e-8.

You can encode small numbers, but with less precision, your values might overshoot if the gradient is too large or settle at zero instead of a small number. Hitting exactly zero can be problematic, as can missing the target with very small numbers, especially in sensitive networks. This issue is more noticeable when weights produce features that are summed up, leading to significant changes, or in deep networks like T5, where small errors can destabilize the entire system.

Transformers and recurrent networks are highly sensitive to such issues. BFloat’s main weakness is its 2-3 digit precision, which is often insufficient for training anything beyond fully connected and convolutional layers.