After batch normalization, we are aiming for unit Gaussian outputs. While initializing data with a unit Gaussian is a common practice, how does applying this normalization throughout the network make sense?
Batch normalization helps keep the network training stable and speeds up convergence. By normalizing the inputs to each layer, it ensures that they have a consistent mean and variance, which helps prevent issues like exploding or vanishing gradients. This makes training easier and more efficient, as each layer deals with data that’s on a similar scale, leading to faster and more stable learning.