I previously discussed model underfitting and overfitting. Essentially, these two concepts describe different ways that the model can fail to match your data set. Underfitting refers to making a model that’s too simple and performs poorly on both its training data and new data sets, missing trends in the data. Overfitting refers to a situation where a model performs well on its training data but performs poorly when exposed to new data.
Another way we can think about these topics is through the terms bias and variance. These two terms are fundamental concepts in data science and represent another way to think about the challenges of model fit. Understanding these two concepts will help you create useful simulation models.
What Is the Bias-Variance Tradeoff?
What Are Model Bias and Variance?
Both terms describe how a model changes as you retrain it using different portions of a given data set. By changing the portion of the data set used to train the model, you can change the functions describing the resulting model. However, models of different structures will respond to new data sets in different ways. Bias and variance describe the two different ways models can respond.
Bias vs. Variance
- Bias describes the error that occurs when a model simplifies the learning process by making assumptions about the data, not gathering enough details about the data as a result. A model with high bias won’t match the data set closely, while a model with low bias will match the data set very closely.
- Variance describes the error that occurs when a model is overly sensitive to its data, identifying noise in addition to patterns. A model with high variance will produce distinct results when trained on various data sets, resulting in dramatically different models each time.
Typically models with high bias have low variance, and models with high variance have low bias. This is because the two come from opposite types of models. A model that’s not flexible enough to match a data set correctly (high bias) is also not flexible enough to change dramatically when given a different data set (low variance).
Those who’ve read my previous article on underfitting and overfitting will probably note a lot of similarities between these concepts. Underfit models usually have high bias and low variance. Overfit models usually have high variance and low bias.
What’s the Tradeoff Between Bias and Variance?
The bias-variance trade-off is a commonly discussed term in data science. Actions that you take to decrease bias (leading to a better fit to the training data) will simultaneously increase the variance in the model (leading to higher risk of poor predictions). The inverse is also true; actions you take to reduce variance will inherently increase bias.
What Can I Do About the Bias-Variance Trade-Off?
Keep in mind increasing variance is not always a bad thing. An underfit model is underfit because it doesn’t have enough variance, which leads to consistently high bias errors. This means when you’re developing a model you need to find the right amount of variance, or the right amount of model complexity. The key is to increase model complexity, thus decreasing bias and increasing variance, until bias has been minimized but before significant variance errors become evident.
Another solution is to increase the size of the data set used to train your model. High variance errors, which lead to overfitting, come from creating a model that’s too complex for the available data set. If you’re able to use more data to train the model, then you can create a model that’s more complex without accidentally adding variance error.
This trick doesn’t help with reducing bias error, unfortunately. A model with low bias, or an underfit model, is not sensitive to the training data. Therefore increasing the size of the data set won’t improve the model significantly because the model isn’t able to respond to the change. The solution to high bias is to make the model more complex through techniques like increasing the model’s parameters or equipping it with more features via feature engineering.
Where Can I Learn More?
I’ve written a few articles on similar topics. This article discusses the details of underfit and overfit models, and gives related examples. Another article discusses the process of model development, validation and testing that you can use to determine when a model has been well fit to the data set. This process will help you decide if you have the right amount of variance in the model and determine whether or not you need to add more data.
Finally, I learned a lot of my starting data science concepts by reading Joel Grus’s book Data Science from Scratch: First Principles with Python. His book introduces many fundamental data science concepts and provides example code to develop some initial functions in Python.
Frequently Asked Questions
What is the bias-variance tradeoff?
The bias-variance tradeoff describes the inverse relationship between bias and variance, where increasing one variable decreases the other. Striking a balance between the two allows a model to learn enough details about a data set without picking up noise and unnecessary information.
How do bias and variance affect model performance?
High bias occurs when a model is too simple and fails to gather enough details about the data, resulting in underfitting. On the other hand, high variance occurs when a model is extremely sensitive to a data set and picks up noise alongside patterns in the data, resulting in overfitting.
How can I reduce bias in my model?
The best way to reduce bias in a model is to increase the model’s complexity. This can be done using different methods like adding more parameters, using a more complex algorithm or introducing more features through feature engineering.