The Architecture

Section 3 from the paper

ReLU Nonlinearity

ReLUs shows faster epoch to reach 0.25 training error rate that saturating non-linear activation functions such as tanh(x) or the sigmoid function.

tanh(x) squashes input to [-1 , 1]
sigmoid squashes input to [0 , 1]
ReLUs are non-saturating because limz→+∞ f(z)=+∞:

ReLUs can avoid problems such as the vanishing gradient problem

A problem with training networks with many layers (e.g. deep neural networks) is that the gradient diminishes dramatically as it is propagated backward through the network. The error may be so small by the time it reaches layers close to the input of the model that it may have very little effect.

Training on Multiple GPUs

Multiple GPUs can allow precisely tuned communication between neurons/kernels.

Layer #3 can communicate with both GPUs from Layer #2 while Layer #4 can communicate with Layer #3 from the same GPU.

eight layers with weights
1. first five are convolutional and remaining three are fully-connected
2. last layer is fed to 1000 class softmax function that outputs a distribution over 1000 class labels
maximizes multinomial logistic regression objective (Maximizing log-likelihood)