https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Section 3 from the paper
ReLUs shows faster epoch to reach 0.25 training error rate that saturating non-linear activation functions such as tanh(x) or the sigmoid function.
ReLUs can avoid problems such as the vanishing gradient problem
A problem with training networks with many layers (e.g. deep neural networks) is that the gradient diminishes dramatically as it is propagated backward through the network. The error may be so small by the time it reaches layers close to the input of the model that it may have very little effect.
Multiple GPUs can allow precisely tuned communication between neurons/kernels.
Layer #3 can communicate with both GPUs from Layer #2 while Layer #4 can communicate with Layer #3 from the same GPU.