What differences in model performance, speed, memory usage, etc., can I expect between choosing BF16 or FP16 for mixed precision training? Is BF16 faster and does it consume less memory, given it’s often touted as “more suitable for Deep Learning”? Why is that the case?
TL;DR: If you have the right hardware, use BF16.
Both BF16 and FP16 use 16 bits per number, consuming the same amount of memory.
: Support tensor cores that boost both BF16 and FP16 performance.
 Do not support BF16, so FP16 is your only option.
I have supported BF16 for quite some time.
Differences Between FP16 and BF16

FP16:
 5 bits for the exponent
 Can encode numbers between 65K and +65

BF16
 8 bits for the exponent (same as FP32)
 Can encode approximately the same range of numbers as FP32
Training in Mixed Precision

FP16:
 Uses rescaling tricks when values exceed its range (>65K or <65K)
 Can cause instability in very large models (like GPT3)

BF16
 Less precise than FP32
 May round very small values to 0, which can cause issues, e.g., dividing by zero
 Can be problematic if your model requires large values and you need to perform inference on hardware that doesn’t support BF16
Practical Considerations
 Inference on Unsupported Hardware:
 Models with large values in BF16 may require additional work to perform inference on hardware that only supports FP16.
 Example: Google’s T5 model requires adjustments to work in FP16.
Use BF16 if your hardware supports it, but be aware of the potential precision issues and compatibility challenges.
One undesirable outcome that could occur is that a number that is extremely near to 0 is rounded to 0 instead of being encoded (same with FP16 but worth in BF16)
What? The representation of numbers is also closer to zero when there are more exponent bits. Before rounding to zero, bf16 can represent much smaller integers than fp16. The smallest FP16 is 5.96e8, while the smallest BF16 is 9.18e41.
Both brain float (BF16) and 16bit floating point (FP16) require two bytes of memory; however, BF16 may represent a far greater numerical range than FP16, which reduces the likelihood of under or overflows. However, less precision is the price paid for this.