Lasso Regression and Elastic Net
Lasso Regression
Lasso Regression (Least Absolute Shrinkage and Selection Operator Regression) is regularized version of Linear Regression. It adds a regularization term to the cost function using the l1 norm of the weight vector. Lasso Regression cost function is
An important characteristic of Lasso Regression is that it tends to eliminate the weights of the least important features, i.e., set them to zero. In other words, Lasso Regression automatically performs feature selection and outputs a sparse model, i.e., with few nonzero feature weights. The Lasso cost function is not differentiable at θi = 0 (for i = 1, 2, , n), but Gradient Descent still works fine if you use a subgradient vector instead when any θi = 0. To avoid Gradient Descent from bouncing around the optimum at the end when using Lasso, we need to gradually reduce the learning rate during training. It will still bounce around the optimum, but the steps will get smaller and smaller, so it will converge.
Elastic Net
Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and we can control the mix ratio r. When r = 0, Elastic Net is equivalent to Ridge Regression, and when r = 1, it is equivalent to Lasso Regression. Elastic Net cost function is
When to use plain Linear Regression, Ridge, Lasso, or Elastic Net could be tricky. It is almost always preferable to have at least a little bit of regularization, so generally we should avoid plain Linear Regression. Ridge is a good default, but if we suspect that only a few features are useful, we should prefer Lasso or Elastic Net because they tend to reduce the useless features’ weights down to zero. In general, Elastic Net is preferred over Lasso because Lasso may behave erratically when the number of features is greater than the number of training instances or when several features are strongly correlated.