Hidden Layer Activation Functions
This blog introduces three most commonly used activation functions in hidden layers: Rectified Linear Activation (ReLU), Logistic (Sigmoid) and Hyperbolic Tangent (Tanh).
Sigmoid
The sigmoid activation function, σ(z) = 1/(1+exp(-z)), is also called the logistic function. Logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress at every step. The function takes any real value as input and outputs values in the range 0 to 1. The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0.
Tanh
The hyperbolic tangent function, tanh(z) = 2σ(2z)-1, is S-shaped, continuous, and differentiable, but its output value ranges from -1 to 1 instead of 0 to 1 in the case of the logistic function. That range tends to make each layer’s output more or less centered around 0 at the beginning of training, which often helps speed up convergence.
ReLU
The ReLU (Rectified Linear Unit) function, ReLU(z) = max(0, z), is perhaps the most common function used for hidden layers. It is common because it is both simple to implement and effective at overcoming the limitations of other popular activation functions, such as Sigmoid and Tanh. Specifically, it is less susceptible to vanishing gradients that prevent deep models from being trained, although it can suffer from other problems like saturated or dead units.
The ReLU function is continuous but unfortunately not differentiable at z = 0. In practice, however, it works very well and has the advantage of being fast to compute, so it has become the default. Most importantly, the fact that it does not have a maximum output value helps reduce some issues during Gradient Descent.
How to Choose
Both the sigmoid and Tanh functions can make the model more susceptible to problems during training, via the so-called vanishing gradients problem. The activation function used in hidden layers is typically chosen based on the type of neural network architecture.
Modern neural network models with common architectures, i.e., MLP (Multilayer Perceptron) and CNN (Convolutional Neural Network), will make use of the ReLU activation function, or extensions. RNN (Recurrent Neural Network) commonly uses Tanh or sigmoid activation functions, or even both. The LSTM (Long Short-term Memory) network commonly uses the Sigmoid activation for recurrent connections and the Tanh activation for output.