by Sergey Ioffe et al.
SGD (Stochastic Gradient Descent) uses mini-batches of examples as opposed to using one example at a time to estimate the gradient of the loss
Problem with SGD is that the layers' inputs' distribution change every update ⇒ model must adapt to new distribution all the time
Internal Covariate Shift: change in the distribution of network activations due to changes in network parameters during training
Problem with adding normalization directly to optimization step:
If a gradient descent step ignores the dependence of E[x] on b, then it will update b ← b + ∆b, where ∆b ∝ −∂ℓ/∂xb. Then u + (b + ∆b) − E[u + (b + ∆b)] = u + b − E[u + b].
We must ensure that network always produces activations with desired distribution and gradient descent optimization takes into account that normalization takes place
We use the Batch Normalization Transform (BN) to rescale back the normalized input to original input when needed (to avoid the problem of layers representing different values due to normalization ⇒ for activation layer e.g.)