We’ve cleaned our data, engineered our features, tuned all our parameters via grid search and optimization, and the model tested well on our hold-out set. All done right? Not so fast, model development is only one small part of the data science process. Model development is to running our model in the wild as practice is to actually playing in a real basketball game. No matter how much we prepare, when our model goes live and starts attempting to predict things in the real world, it will inevitably run into situations that it is ill-prepared for.

That is when data scientists (and all model builders) face a dilemma. Do we keep our faith in the predictions or turn it off?
 

It Depends

The answer, as always, is: “It depends.” Here are a few factors to keep in mind when trying to decide whether to turn off an ill-performing model.

 

How Honest Was Your Training/Tuning Process?

In the business world, there is tremendous pressure to deliver results. No one wants to tell their boss, “None of the features have any predictive power, and all the correlations look to be spurious.” But sometimes that is the truth. In data science and especially the niche of it known as data mining, correlations are truly a dime a dozen. If we look hard enough we will find things that seem to predict our target variable, or, in other words, noise masquerading as signal.

Let’s do an experiment. If we sampled a bunch of random variables that had zero correlation to our target, what is the likelihood that some of them would look predictive in our sample? I ran this analysis and found that (as expected) the more random variables we included in our data set, the greater the number of them that appeared to be predictive. With just 20 features, I found that none of them had a significant correlation (defined as correlation > 0.4) with the target. With 500 features, however, I found eight features that shared a significant correlation with the target. And if we relax our standards a bit, I found 25 features with correlations of 0.3 or more with the target. Seeing this, we could fool ourselves into thinking that we have the building blocks for a deployable model when we really do not.

Models built blindly with zero thought put into the key relationships that drive the results run a much higher risk of being overfit. An overfit model is one that has latched on to relationships between variables when there are actually none. These kinds of models may even work for a time due to random chance, but at some point, our luck will run out; the spurious correlations and noise will start to dominate, rendering the results inaccurate and useless. Thus, using intuition when thinking about our choices will be critical for deciding whether to take a model into production in the first place and whether to keep faith in an underperforming model.

 

How Significant Is Our Edge?

In the business world, we are always looking for a competitive edge. It’s like that old poker adage: “Look around the poker table; if you can’t see the sucker, you’re it.”

We should adopt the same approach when building models. In my opinion, there are two broad categories of models:

Type 1: Models that attempt to forecast the future (usually time series). These models are generally characterized by scarce and sometimes dirty data.

Type 2: Models that attempt to predict future behavior via an analysis of a sample that is deemed to be representative of the population.

While both model types try to predict something, Type 2 models tend to be better behaved. For example, Amazon has a huge user base that includes all types of people. So when Amazon builds user behavior models, as long as it’s taken a robust sample of its users, it can be reasonably sure that the data and any predictions derived from it will generalize well when predicting the behavior of new, never-before-seen users. In this case, Amazon’s edge is the breadth and depth of data that it has on its users, and it’s a huge one.

Type 1 models are tougher. The future and its numerous possibilities are inherently hard to forecast. What makes this modeling even harder is that we attempt to do it with very limited amounts of historical data. Finally, our model predictions assume that the conditions and relationships between variables that held in the past continue to hold today. Because of all these difficulties, Type 1 models usually have little to no edge (besides proprietary data, if we’re lucky to have it). Faith in these models should be, at most, weakly held; the outputs of these models should be viewed skeptically.

Importantly, just because companies like Amazon possess amazing and proprietary data sets doesn’t mean they don’t have Type 1 models. Even with all its data, I would wager that Amazon has almost as much trouble predicting next year’s GDP growth or the timing of the next recession as you or I would.

 

How Many Coins Are We Really Flipping?

Borrowing finance terminology, we want our model to be a suite of uncorrelated alpha streams. An alpha stream is a source of outperformance relative to a benchmark, or, in other words, an unfair coin.

Would you rather toss one unfair coin (say the coin had a 60 percent chance of coming up heads) once to win $1,000 (if it comes up heads) or toss the same unfair coin 1,000 times, and receive $1 for each head? Both of these have the same expected value but tossing it many times has a much lower standard deviation around the possible outcomes. If you toss the coin just once, there’s a 40 percent chance you end up with nothing. If you tossed it 1,000 times, there’s pretty much zero chance that you would come away with nothing (this is, of course, at the expense of some upside because it’s also very unlikely that you would walk away with the entire $1,000).

When you can get the same expected return for less volatility, you take it. Our model should ideally be the same: a portfolio of unfair coins whose outcomes are uncorrelated (independent of each other). If we have a robust model, then we should trust its ability to generalize and adapt.

However, if our model is instead like the single toss of the coin, then we should be searching for more alpha streams to add. And if we can’t find any, then we need to prepare carefully for the volatility, shrink our bets, or turn it off.

 

Has The Regime Changed?

Last but definitely not least, has the environment changed so much that our model is rendered useless? If our model is trained on data from normal economic times, then we shouldn’t expect its predictions to be accurate during a recession. If the currently prevailing conditions are very different from those that the model was trained on, then it’s time to turn the model off.

 

Conclusion

Being data driven doesn’t mean blindly following every decision made by a quantitative model. Rather, it means supplementing logic and business sense with models and data. A big part of that is understanding the model and knowing under what situations it is likely to become unreliable.

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us