I recently discussed model underfitting and overfitting. Essentially, these two concepts describe different ways that the model can fail to match your data set. Underfitting refers to making a model that’s not complex enough to accurately represent your data and misses trends in the data set. Overfitting refers to a situation where the model is too complex for the data set, and indicates trends in the data set that aren’t actually there.
Another way we can think about these topics is through the terms bias and variance. These two terms are fundamental concepts in data science and represent another way to think about the challenges of model fit. Understanding these two concepts will help you create useful simulation models.
What Is the Bias-Variance Tradeoff?
What Are Model Bias and Variance?
Both terms describe how a model changes as you retrain it using different portions of a given data set. By changing the portion of the data set used to train the model, you can change the functions describing the resulting model. However, models of different structures will respond to new data sets in different ways. Bias and variance describe the two different ways models can respond.
Bias vs. Variance
- Bias describes how well a model matches the training set. A model with high bias won’t match the data set closely, while a model with low bias will match the data set very closely. Bias comes from models that are overly simple and fail to capture the trends present in the data set.
- Variance describes how much a model changes when you train it using different portions of your data set. A model with high variance will have the flexibility to match any data set you provided it, which may result in dramatically different models each time. Variance comes from models that are highly complex and employ a significant number of features.
Typically models with high bias have low variance, and models with high variance have low bias. This is because the two come from opposite types of models. A model that’s not flexible enough to match a data set correctly (high bias) is also not flexible enough to change dramatically when given a different data set (low variance).
Those who’ve read my previous article on underfitting and overfitting will probably note a lot of similarity between these concepts. Underfit models usually have high bias and low variance. Overfit models usually have high variance and low bias.
What’s the Tradeoff Between Bias and Variance?
The bias-variance trade-off is a commonly discussed term in data science. Actions that you take to decrease bias (leading to a better fit to the training data) will simultaneously increase the variance in the model (leading to higher risk of poor predictions). The inverse is also true; actions you take to reduce variance will inherently increase bias.
What Can I Do About the Bias-Variance Trade-Off?
Keep in mind increasing variance is not always a bad thing. An underfit model is underfit because it doesn’t have enough variance, which leads to consistently high bias errors. This means when you’re developing a model you need to find the right amount of variance, or the right amount of model complexity. The key is to increase model complexity, thus decreasing bias and increasing variance, until bias has been minimized but before significant variance errors become evident.
Another solution is to increase the size of the data set used to train your model. High variance errors, also referred to as overfitting models, come from creating a model that’s too complex for the available data set. If you’re able to use more data to train the model, then you can create a model that’s more complex without accidentally adding variance error.
This trick doesn’t help with reducing bias error, unfortunately. A model with low bias, or an underfit model, is not sensitive to the training data. Therefore increasing the size of the data set won’t improve the model significantly because the model isn’t able to respond to the change. The solution to high bias is higher variance, which usually means adding more data.
Where Can I Learn More?
I’ve written a few articles on similar topics. This article discusses the details of underfit and overfit models, and gives related examples. Another article discusses the process of model development, validation and testing that you can use to determine when a model has been well fit to the data set. This process will help you decide if you have the right amount of variance in the model and determine whether or not you need to add more data.
Finally, I learned a lot of my starting data science concepts by reading Joel Grus’s book Data Science from Scratch: First Principles with Python. His book introduces many fundamental data science concepts and provides example code to develop some initial functions in Python.