by Sergey Ioffe et al.


Introduction

SGD (Stochastic Gradient Descent) uses mini-batches of examples as opposed to using one example at a time to estimate the gradient of the loss

Problem with SGD is that the layers' inputs' distribution change every update ⇒ model must adapt to new distribution all the time


Towards Reducing Internal Covariate Shift

Internal Covariate Shift: change in the distribution of network activations due to changes in network parameters during training

Problem with adding normalization directly to optimization step:

If a gradient descent step ignores the dependence of E[x] on b, then it will update b ← b + ∆b, where ∆b ∝ −∂ℓ/∂xb. Then u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b].

  1. This means that b will indefinitely grow while the output and loss do not change
  2. model eventually explodes as gradient descent has no effect on parameters

We must ensure that network always produces activations with desired distribution and gradient descent optimization takes into account that normalization takes place


Normalization via Mini-Batch Statistics

We use the Batch Normalization Transform (BN) to rescale back the normalized input to original input when needed (to avoid the problem of layers representing different values due to normalization ⇒ for activation layer e.g.)

We use the Batch Normalization Transform (BN) to rescale back the normalized input to original input when needed (to avoid the problem of layers representing different values due to normalization ⇒ for activation layer e.g.)