The Bias/Variance Trade-off
A model’s generalization error can be expressed as the sum of three very different errors: bias, variance and Irreducible error. This blog introduces the bias/variance trade-off and overfitting/underfitting.
Bias and Variance
Bias of the generalization error is due to wrong assumptions. A high-bias model is most likely to underfit the training data. Variance is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom is likely to have high variance and thus overfit the training data. Irreducible error is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data.
Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance. Therefore, it is called a trade-off.
Overfitting and Underfitting
In general we won’t know what function generated the data, so how to tell the model is overfitting or underfitting the data is a tough problem.
If a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then the model is overfitting. If it performs poorly on both, then it is underfitting. Another way to tell is to look at the learning curves: these are plots of the model’s performance on the training set and the validation set as a function of the training set size or the training iteration. To generate the plots, train the model several times on different sized subsets of the training set.
If the model is underfitting the training data, adding more training examples will not help. We need to use a more complex model or come up with better features. To improve an overfitting model, we can feed it more training data until the validation error reaches the training error.