In 2017, Andrew Ng, a widely recognized expert in machine learning, helped publish a paper in which he and his team used a deep learning model to detect pneumonia from chest X-ray images. In the initial publication, they inadvertently published overly optimistic results because they didn’t properly account for the fact that some patients appeared more than once in the data set because, several had more than one X-ray available). Although the researchers corrected the issue after Nick Roberts pointed it out, it goes to show that even experts and trained practitioners can fall victim to one of the biggest challenges often faced in applied machine learning: leakage.
In essence, data leakage (referred to as just leakage from this point on) refers to flaws in a machine learning pipeline that lead to overly optimistic results. Although leakage can be hard to detect, even for experts, “too good to be true” performance is often a dead giveaway! Similarly, leakage is a broad term, but Sayash Kapoor and Arvind Narayanan have created a taxonomy of several different types, as well as the model info sheet to help avoid leakage in practice.
In this article, we’ll briefly discuss several types of leakage. Further, though Kapoor and Narayanan’s taxonomy is based on a survey of applied machine learning in the social sciences, leakage is indeed widespread across all industries, including academia.
What Is a Model Info Sheet?
Introduced by Sayash Kapoor and Arvind Narayanan, model info sheets provide data scientists with a solid template for making precise arguments needed to justify the absence of leakage in their machine learning pipelines.
Leakage, Leakage Everywhere
Leakage can happen at any point in a machine learning pipeline: during data collection and preparation, data preprocessing, model training, and/or model validation and deployment. Below are some common examples we often see in practice.
Data Preparation
The error at this step entails collecting and using features that won’t be available in deployment. For instance, using information on why a hospital patient was readmitted to a hospital within 30 days of release is not useful for predicting the likelihood that a new patient will be readmitted (i.e., because that information will be unknown for new patients!).
Preprocessing
Errors here result from not maintaining good separation between training and validation samples. For instance, applying feature selection (or any other preprocessing technique) to the data prior to splitting into training and validation samples.
Grouping Problems
This happens when you ignore the grouping or temporal nature of data when splitting into training and validation samples.. This is similar to the issue mentioned earlier regarding Andrew Ng and his coauthors.
A Taxonomy of Leakage
Now, we will describe in more detail Kapoor and Narayanan’s general taxonomy of leakage.
Issues Separating Training and Validation Data
If the training data interacts with the validation data (or other evaluation data, such as the out-of-fold samples in k-fold cross-validation) during model training, it will result in leakage. This is because the model has access to information in the evaluation data before the model is actually evaluated.
First and foremost, this is arguably the most common type of leakage we’ve personally seen in practice. Some specific examples include the following.
Preprocessing the Full Data Set
Doing steps like missing value imputation on the full data set causes leakage. For example, if you’re performing a simple mean imputation, the sample means should be computed only from the training set, then used to impute missing values in all other data partitions (e.g., evaluation and holdout samples). Random over- and undersampling for imbalanced data is another preprocessing step where leakage typically occurs.
Feature Selection
Performing feature selection on the same data used for training and evaluation is a big statistical no-no. In particular, selecting features using all available data before splitting into train and evaluation sets leaks information about what features work on the evaluation data. The simplest solution is to use a separate, independent sample for feature selection, then throw the sample away when done. It gets more complicated with cross-validation, and experts encourage the use of pipelines.
Duplicates in the Data
Having experimental units (e.g., households or patients) with multiple records that appear in both the training and evaluation samples will cause leakage. The pneumonia X-ray example discussed earlier contained patients, some of which had multiple chest X-rays over time. Including X-rays from the same patients in both training and evaluation is a form of leakage and can lead to overly optimistic results. A simple solution is to use some form of group-based partitioning (e.g., grouped k-fold cross-validation) to make sure groups are preserved between training and evaluation.
Suppose you want to normalize the numeric predictor variables in your data set prior to training a model. Such a step is common in some machine learning algorithms like regularized regression and neural networks. A common normalization technique is to transform each numeric variable by subtracting its mean and dividing by its standard deviation.
If the means and standard deviations are computed from the entire data set, however, (i.e., before splitting into training and evaluation samples) then leakage will occur. The proper way to accomplish this is to compute the mean and standard deviations from the training data alone, then use that to apply normalization on all of the evaluation samples. This is a common but subtle example of leakage.
Other situations are much harder to detect without proper context around each of the predictor variables and complete documentation of the data set construction and machine learning pipeline steps.
Model Uses ‘Illegal Features’
If the model has access to features that would not be available at the time of deployment, like proxies of the target variable, it can cause leakage. For example, suppose you're modeling whether or not a customer redeemed a particular offer they were sent. One of the variables in the constructed data set might be date_redeemed, meaning the date at which customer redeemed the offer. Though this feature would perfectly predict the binary outcome in question, it would be entirely inappropriate to use in a model in production since we wouldn’t know this information for future, unsent offers and new customers.
Test Set Isn’t Drawn From the Distribution of Interest
If the evaluation data set has a distribution that does not match the distribution of scientific interest, it constitutes leakage. This includes issues like temporal leakage in time series and forecasting (e.g., using future data to predict the past), dependencies between training and evaluation data, and sampling bias in the evaluation sample. Some more specific examples are listed below.
Temporal Leakage
Future data leaks into the training set if the evaluation sample contains records that occur in time before some of the training instances. For example, this could happen if using future stock prices to train a model to predict historical prices. This is common in time-series applications where we need to be sure we’re evaluating a model where only historical data is used to predict the future.
Sampling Bias
Using an evaluation sample that is not representative of the real distribution of interest will cause leakage. For example, evaluating a model on only one demographic group or geographic area and using it to make predictions outside of that population.
Model Info Sheets: A Solution to Leakage
Since even experts in machine learning can be surprised by subtle forms of leakage, you can imagine how riskier it is when less initiated data scientists are faced with building potentially very complex machine learning pipelines. To that end, Kapoor and Narayanan introduced the idea of model info sheets. These are a fantastic concept our organization has adopted as part of our own internal review process required before machine learning models reach production. In short, model info sheets provide data scientists with a solid template for making precise arguments needed to justify the absence of leakage in their machine learning pipelines. For example, the template asks all preprocessing steps be listed in detail as well as a list of each feature and a description for its legitimacy as a feature in the model.
Key Takeaways
Avoiding leakage is one of the biggest challenges data scientists face when building machine learning pipelines. Leakage tends to result in overly optimistic results. And while such errors can often be caught post-production with appropriate use of MLOps (e.g., model monitoring), it can be risky and expensive for a leaky pipeline to make it into production. Though some forms of leakage are commonly known, other forms are more subtle and even experts fall victim to such cases. To that end, we’ve found Kapoor and Narayanan’s model info sheetsby to provide the best defense against leakage in applied practice.