Let’s go over some widely used regularization techniques and the key differences between them. If you need a refresher on regularization in supervised learning models, start here

When you have a large number of features in your data set, you may wish to create a less complex, more parsimonious model. Two widely used regularization techniques used to address overfitting and feature selection are L1 and L2 regularization.

L1 vs. L2 Regularization Methods

  • L1 Regularization, also called a lasso regression, adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.
  • L2 Regularization, also called a ridge regression, adds the “squared magnitude” of the coefficient as the penalty term to the loss function.

A regression model that uses the L1 regularization technique is called lasso regression and a model that uses the L2 is called ridge regression.

The key difference between these two is the penalty term.

Back to Basics on Built InA Primer on Model Fitting

 

L1 Regularization: Lasso Regression

Lasso is an acronym for least absolute shrinkage and selection operator, and lasso regression adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.

l2-regularization
Cost function

Again, if lambda is zero, then we'll get back OLS (ordinary least squares) whereas a very large value will make coefficients zero, which means it will become underfit.

More Built In TutorialsAn Introduction to Bias-Variance Tradeoff

 

L2 Regularization: Ridge Regression

Ridge regression adds the “squared magnitude” of the coefficient as the penalty term to the loss function. The highlighted part below represents the L2 regularization element.

l2-regularization
Cost function

Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and lead to underfitting. Having said that, how we choose lambda is important. This technique works very well to avoid overfitting issues.

The key difference between these techniques is that lasso shrinks the less important feature’s coefficient to zero thus, removing some features altogether. In other words, L1 regularization works well for feature selection in case we have a huge number of features.

L2 Regularization

Learn More From Our Data Science ExpertsModel Validation and Testing: A Step-by-Step Guide

Traditional methods like cross-validation and stepwise regression to perform feature selection and handle overfitting work well with a small set of features but L1 and L2 regularization methods are a great alternative when you’re dealing with a large set of features.

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us