Ridge Regression
One way to reduce overfitting is to regularize the model, i.e., to constrain it. The fewer degrees of freedom it has, the harder it will be for it to overfit the data. For a linear model, regularization is typically achieved by constraining the weights of the model.
Ridge Regression is a regularized version of Linear Regression: a regularization term equal to is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. Note that the regularization term should only be added to the cost function during training. Once the model is trained, we want to use the unregularized performance measure to evaluate the model’s performance.
The hyperparameter α controls how much we want to regularize the model. If α = 0, then Ridge Regression is just Linear Regression. If α is very large, then all weights end up very close to zero and the result is a flat line going through the data’s mean. Increasing α leads to flatter predictions, thus reducing the model’s variance but increasing its bias. The Ridge Regression cost function presents as follows:
It is important to scale the data before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models. As with Linear Regression, we can perform Ridge Regression either by computing a closed-form equation or by performing Gradient Descent. The pros and cons are the same.