L1 and L2 Regularization Methods, Explained

L1 and L2 regularization are the best ways to manage overfitting and perform feature selection when you’ve got a large set of features in a machine learning model.

Written by Anuja Nagpal
l2-regularization people discussing a topic in front of a laptop
Image: Shutterstock / Built In
Brand Studio Logo
UPDATED BY
Brennan Whitfield | Oct 03, 2024

L1 and L2 regularization are methods used to mitigate overfitting in machine learning models. These techniques are often applied when a model’s data set has a large number of features, and a less complex model is needed.

A regression model that uses the L1 regularization technique is called lasso regression, and a model that uses the L2 is called ridge regression.

L1 vs. L2 Regularization Methods

  • L1 Regularization: Also called a lasso regression, adds the absolute value of the sum (“absolute value of magnitude”) of coefficients as a penalty term to the loss function.
  • L2 Regularization: Also called a ridge regression, adds the squared sum (“squared magnitude”) of coefficients as the penalty term to the loss function.

Both L1 and L2 regularization add a penalty term to the loss function in a machine learning model. This puts weight on certain values to discourage the model from fitting training data too closely, and to reduce overall model complexity. However, both regularization methods have different penalty terms and specific use cases — here’s the key differences between them.

Back to Basics on Built InA Primer on Model Fitting

 

L1 Regularization: Lasso Regression

L1 regularization, or lasso (“least absolute shrinkage and selection operator”) regression, is a regularization method that penalizes high-value coefficients in a machine learning model. L1’s penalty term is the absolute value of the sum of coefficients (or the “absolute value of magnitude” of the coefficient), and is added to the model’s sum of squared errors (SSE) loss function. 

Both penalty terms for L1 and L2 regularization are controlled by the model hyperparameter lambda (λ), which determines the tradeoffs between bias and variance in coefficients.

l2-regularization
Cost function

Lasso regression can reduce coefficient values to zero, enabling feature selection and removal. This aspect also allows lasso regression to handle some multicollinearity (high correlations among features) in a data set without affecting interpretability. However, lasso regression isn’t suitable for high multicollinearity — if there are highly-correlated covariates present, L1 regularization will randomly remove one of the features from the model. 

Although it’s used to resolve overfitting, L1 regularization can actually cause model underfitting based on the lambda value. If lambda is zero, then regularization is disabled. Here, we’ll get back ordinary least squares (OLS) whereas a very large value will make coefficients zero. This means the model will become underfit.

Learn More From Our Data Science ExpertsModel Validation and Testing: A Step-by-Step Guide

 

L2 Regularization: Ridge Regression

L2 regularization, or ridge regression, is a regularization method that penalizes high-value coefficients in a machine learning model similarly to L1 regularization — though with a different penalty term. L2 regularization adds the squared sum of coefficients (or the “squared magnitude” of the coefficient) as the penalty term to the model’s SSE loss function. 

Unlike L1 regularization, L2 regularization can reduce coefficient values toward zero but never exactly to zero. This also means that L2 regularization cannot perform feature selection. 

The highlighted part below represents the L2 regularization element.

l2-regularization
Cost function

If model interpretability isn’t a concern, L2 regularization can effectively handle multicollinearity in a data set. Since ridge regression reduces coefficients close to zero, this helps to distribute the effects of correlated variables evenly across various features.

Again, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and lead to underfitting. Having said that, how we choose lambda is important. This technique works very well to avoid overfitting issues.

 

L2 Regularization explained. | Video: StatQuest With Josh Starmer

When to Use L1 vs. L2 Regularization

The key difference between L1 and L2 regularization techniques is that lasso regression shrinks the less important feature’s coefficient to zero, removing some features altogether. In other words, L1 regularization works well for feature selection in case we have a huge number of features.

Traditional methods like cross-validation and stepwise regression to perform feature selection and handle overfitting work well with a small set of features, but L1 and L2 regularization methods are a great alternative when you’re dealing with a large set of features.

Frequently Asked Questions

L2 regularization, or ridge regression, is a machine learning regularization technique used to reduce overfitting in a machine learning model. L2 regularization’s penalty term is the squared sum of coefficients, and applies this into the model’s sum of squared errors (SSE) loss function to mitigate overfitting. L2 regularization can reduce coefficient values and feature weights toward zero (but never exactly to zero), so it cannot perform feature selection like L1 regularization.

L1 regularization is most effective for enabling feature selection and maintaining model interpretability, while L2 regularization is effective for handling multicollinearity and prioritizing model accuracy and stability.

Both L2 regularization and L1 regularization prevent overfitting in machine learning models by applying penalty weights on values. This prevents models from putting too much importance on specific features, which encourages a more balanced weight distribution across all features and a generalized, less complex model.

The L1 regularization norm is calculated as the sum of absolute values of the vector. The L2 regularization norm is calculated as the square root of the sum of the squared vector values.

Explore Job Matches.