Let’s go over some widely used regularization techniques and the key differences between them. If you need a refresher on regularization in supervised learning models, start here.
When you have a large number of features in your data set, you may wish to create a less complex, more parsimonious model. Two widely used regularization techniques used to address overfitting and feature selection are L1 and L2 regularization.
L1 vs. L2 Regularization Methods
- L1 Regularization, also called a lasso regression, adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.
- L2 Regularization, also called a ridge regression, adds the “squared magnitude” of the coefficient as the penalty term to the loss function.
A regression model that uses the L1 regularization technique is called lasso regression and a model that uses the L2 is called ridge regression.
The key difference between these two is the penalty term.
L1 Regularization: Lasso Regression
Lasso is an acronym for least absolute shrinkage and selection operator, and lasso regression adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.
Again, if lambda is zero, then we'll get back OLS (ordinary least squares) whereas a very large value will make coefficients zero, which means it will become underfit.
L2 Regularization: Ridge Regression
Ridge regression adds the “squared magnitude” of the coefficient as the penalty term to the loss function. The highlighted part below represents the L2 regularization element.
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and lead to underfitting. Having said that, how we choose lambda is important. This technique works very well to avoid overfitting issues.
The key difference between these techniques is that lasso shrinks the less important feature’s coefficient to zero thus, removing some features altogether. In other words, L1 regularization works well for feature selection in case we have a huge number of features.
Traditional methods like cross-validation and stepwise regression to perform feature selection and handle overfitting work well with a small set of features but L1 and L2 regularization methods are a great alternative when you’re dealing with a large set of features.