The Vanishing/Exploding Gradients Problems

2 min readNov 4, 2021

Gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the Gradient Descent update leaves the lower layers’ connection weights virtually unchanged, and training never converges to a good solution, which is called the vanishing gradients problem. Oppositely, the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges, which is called the exploding gradients problem. It surfaces in recurrent neural networks.

Neural networks are trained using stochastic gradient descent. This involves first calculating the prediction error made by the model and using the error to estimate a gradient used to update each weight in the network so that less error is made next time. This error gradient is propagated backward through the network from the output layer to the input layer. It is desirable to train neural networks with many layers, as the addition of more layers increases the capacity of the network, making it capable of learning a large training dataset and efficiently representing more complex mapping functions from inputs to outputs.

A problem with training networks with many layers is that the gradient diminishes dramatically as it is propagated backward through the network. The error may be so small by the time it reaches layers close to the input of the model that it may have very little effect. As such, this problem is referred to as the vanishing gradients problem. In fact, the error gradient can be unstable in deep neural networks and not only vanish, but also explode, where the gradient exponentially increases as it is propagated backward through the network. This is referred to as the exploding gradient problem.

The Vanishing/Exploding Gradients Problems

Written by Kinder Chen