In a previous article we discussed how to identify underfitting and overfitting, how these phenomena can lead to models that don’t match the available data and how to identify models that do fit the data well. These concepts can help you avoid major blunders and generate models that fit the data reasonably accurately. Now it’s time to think beyond accuracy and focus on precision. In this article, we’ll work to identify which of the possible models is the best fit for your data.
Model Validation and Testing
You cannot trust a model you’ve developed simply because it fits the training data well. The reason for this is simple: You forced the model to fit the training data!
The solution: model validation. Validation uses your model to predict the output in situations outside your training data, and calculates the same statistical measures of fit on those results. This means you need to divide your data set into two different data files. The first is a training data set, which you use to generate your model, while the second is a validation data set, which you use to check your model’s accuracy against data you didn’t use to train the model.
7 Steps to Model Development, Validation and Testing
- Create the development, validation and testing data sets.
- Use the training data set to develop your model.
- Compute statistical values identifying the model development performance.
- Calculate the model results to the data points in the validation data set.
- Compute statistical values comparing the model results to the validation data.
- Calculate the model results to the data points in the testing data set.
- Compute statistical values comparing the model results to the test data.
Let’s say you’re creating multiple models for a project. The natural choice is to select the model which most accurately fits your validation data and move on. However, now we have another potential pitfall. Simply because a model closely matches the validation data doesn’t mean the model matches reality. While the model in question performs best in this particular test, it could still be wrong.
The final step, and ultimate solution to the problem, is to compare the model which performed best in the validation stage against a third data set: the test data. This test data is, again, a subset of the data from the original data source. It consists only of points that were used in neither the model’s development nor its validation. We consider a model ready for use only when we compare it against the test data, and the statistical calculations show a satisfactory match.
Model Development, Validation and Testing: Step-by-Step
This process breaks down into seven steps.
1. Create the Development, Validation and Testing Data Sets
To start off, you have a single, large data set. Remember: You need to break it up into three separate data sets, each of which you’ll use for only one phase of the project. When you’re creating each data set, make sure they contain a mixture of data points at the high and low extremes, as well as in the middle of each variable range. This process will ensure the model will be accurate at all ranges of the spectrum. Also, make sure most of the data is in the training data set. The model can only be as accurate as the data set used to create it, and more data means a higher chance of accuracy.
2. Use the Training Data Set to Develop Your Model
Input the data set into your model development script to develop the model of your choice. There are several different models you could develop depending on the data sources available and questions you need to answer. (You can find more information on the types of models in Data Science from Scratch.) In this phase, you’ll want to create several different models of different structures, or several regression models of different orders. In other words, generate any model that you think may perform well.
3. Compute Statistical Values Identifying the Model Development Performance
Once you’ve developed your models, you need to compare them to the training data you used. Higher-performing models will fit the data better than lower-performing models. To do this, you need to calculate statistical values designed for this purpose. For instance, a common way to check the performance of a regression model is to calculate the r² value.
4. Calculate the Model Results to the Data Points in the Validation Data Set
In this step, you’ll use the validation data as input data for the model to generate predictions. Then you’ll need to compare the values predicted by the model with the values in the validation data set. Once complete, you have both the real values (from the data set) and predicted values (from the model). This allows you to compare the performance of different models to the data in the validation data set.
5. Compute Statistical Values Comparing the Model Results to the Validation Data
Now that you have the data value and the model prediction for every instance in the validation data set, you can calculate the same statistical values as before and compare the model predictions to the validation data set. This is a key part of the process.
The first statistical calculations identified how well the model fit the data set you forced it to fit. In this case, you’re ensuring the model is capable of matching a separate data set, one that had no impact on the model development. Complete your statistical calculations of choice on each model, then choose the model with the highest performance.
6. Calculate the Model Results to the Data Points in the Testing Data Set
Use the test data set as input for the model to generate predictions. Only perform this task using the highest performing model from the validation phase. Once you complete this step, you’ll have both the real values and the model’s corresponding predictions for each input data instance in the data set.
7. Compute Statistical Values Comparing the Model Results to the Test Data
For the final time, perform your chosen statistical calculations comparing the model’s predictions to the data set. In this case you only have one model, so you aren’t searching for the best fit. Instead, you’re checking to ensure your model fits the test data set closely enough to be satisfactory.
Once you’ve developed a model that satisfactorily matches the test data set, you’re ready to start generating predictions. Don’t assume this means you’re done with model development completely, though; there’s a good chance you’ll eventually decide you need to tweak your model based on new available data sets.