Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

by Sergey Ioffe et al.

Introduction

SGD (Stochastic Gradient Descent) uses mini-batches of examples as opposed to using one example at a time to estimate the gradient of the loss

gradient estimation improved by increasing the batch-size
computation over a batch can be more efficient than m computations for individual examples

Problem with SGD is that the layers' inputs' distribution change every update ⇒ model must adapt to new distribution all the time

covariate shift - when input distribution to a learning system changes
sub-network: Having the same distribution of inputs can help avoid the problem of having theta values readjust to compensate for the changes in the distribution during updates
outside-network: since input, x, is affected by W and b, changes to those parameters during training will likely move many dimensions of x into the saturated regime of the non-linearity activation layer ⇒ slow down convergence

Towards Reducing Internal Covariate Shift

Internal Covariate Shift: change in the distribution of network activations due to changes in network parameters during training

Problem with adding normalization directly to optimization step:

If a gradient descent step ignores the dependence of E[x] on b, then it will update b ← b + ∆b, where ∆b ∝ −∂ℓ/∂xb. Then u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b].

This means that b will indefinitely grow while the output and loss do not change
model eventually explodes as gradient descent has no effect on parameters

We must ensure that network always produces activations with desired distribution and gradient descent optimization takes into account that normalization takes place

Normalization via Mini-Batch Statistics

We use the Batch Normalization Transform (BN) to rescale back the normalized input to original input when needed (to avoid the problem of layers representing different values due to normalization ⇒ for activation layer e.g.)