Summary of BatchNorm paper

Paper

Achievments

Key Contributions

Internal Covariance Shift

When the input distribution to a learning systems changes, it is known as covariance shift. This concept can be extended beyond the learning system as a whole to its parts, such as a sub-network or layer. This change in the distribution of sub-network activations (i.e. inputs to a layer) due to the change in network parameters (of preceding layers) is called Internal covariance shift. This slows down the deep neural network training by requiring:

BatchNorm

To improve the training processes, we seek to reduce the internal covariance shift. The authors proposed a new mechanism called Batch Normalization (BatchNorm), which reduces internal covariate shift and dramatically accelerates the training of deep neural nets. It accomplishes this via a normalization step that fixes the mean and variance of the layer $l$, computed for each mini-batch, to parameters $\beta_l$ and $\gamma_l$ respectively. These parameters in turn are learnt by gradient descent. BatchNorm also has a beneficial effect on the gradient flow through the network, by reducing the dependence of gradients on the scale of the parameters or of their initial values.

Advantages of BatchNorm include:

Experiments

Various experiments were conducted to understand the accelerating effect of batchnorm on deep neural networks. The experiments were done using single-network classification trained on the ImageNet 2012 training data, and tested on the validation data.

batchnorm

The ensemble of 6 BN-x30 classification networks improved over the previous state-of-the-art results.

The goal of BatchNorm is to achieve a stable distribution of activation values throughout the training. Authors applied it before the nonlinearity since that is where fixing the first and second moments is more likely to result in a stable distribution.