What Happens When Researchers ‘Clean’ Data?

Although researchers often spend little time discussing data preparation, it has the potential to massively alter a given study’s results. To ensure research remains useful, we need universal standards and better documentation.
data-preparation-cleaning
Edward Hearn
Expert Columnist
June 4, 2021
Updated: July 13, 2021
data-preparation-cleaning
Edward Hearn
Expert Columnist
June 4, 2021
Updated: July 13, 2021

Most economists have likely heard an old joke many times from researchers outside the field. It goes: “There are three scientists stranded on a desert island  a physicist, a chemist and an economist. They find a can of food but don’t have any way of opening it. The physicist suggests that they should use leverage to pop the top off of the can. The chemist suggests they build a fire and heat the can, thus blowing the lid off. They both look at the economist, who says, ‘Let’s assume a can opener.’”

When working with non-experimentally collected information, researchers must frequently make assumptions about how best to process, clean and model their data. The statistician Andrew Gelman refers to these decision points, which can occur at any point during the research process, as a “garden of forking paths.” 

Researcher Degrees of Freedom

When working with non-experimentally collected information, researchers must frequently make assumptions about how best to process, clean and model their data. A common term for these decisions made, whether explicit or implicit, is “researcher degrees of freedom.” These decisions occur at many points during the research process and mostly go unstated. Different researchers could reasonably choose different decision paths when faced with the same data. This level of flexibility, however, can lead multiple researchers to produce results that are radically different from each other using identical data.

 

Degrees of Discrepancy

A more common term for decisions made, whether explicit or implicit, when distilling information into a usable format is “researcher degrees of freedom.” These decisions occur at many points during the research process and mostly go unstated. What’s more, different researchers could reasonably choose different decision paths when faced with the same data. This level of flexibility, however, can lead multiple researchers to produce results that are radically different from each other using identical data.

The problem is that most observational data presents too many different decision forks. As a result, researchers have to make too many of their own decisions, which are often isolated from one another. Differences in one or two assumptions during the data processing or analysis phases of research usually won’t result in huge discrepancies in outcome. At scale, though, the sheer number of decisions researchers typically don’t think twice about and hardly ever document has precipitated the current replication crisis across the social sciences.

The rapid expansion of access to common data sources (i.e., the Census, the Bureau of Labor Statistics, the Federal Reserve and others) over the last 15 years has exacerbated this problem. The lack of universal best practices for data reporting, standardization and aggregation has led empirical research to lose credibility. Without solid guidance, researchers must make so many independent assumptions across all levels of the research process that the number of different outcomes quickly overwhelms generalizable empirical insights. 

More From Edward HearnCorrelation Is Not Causation. Except When It Is.

 

Framing the Problem

Nick Huntington-Klein and other researchers recently found that researcher degrees of freedom led to radically different conclusions in empirical economic analyses. Further, they determined that most of the data preparation and analysis decisions made by independent teams of researchers that drove these different outcomes would never have been reported in the final results. 

Huntington-Klein and his team gave data from two previously published economic studies to seven different replicators. The team also framed the research questions to ensure the replicators could answer the same questions the published works addressed, but in such a way that they wouldn’t recognize the published studies from the data. 

The study found that differences in replicators’ processing and cleaning of the externally generated data led to huge discrepancies in outcomes. No two replicators reported the same sample size, estimate sizes and signs differed across replicators and the standard deviation of the estimates across the seven replicators was three to four times as large as the standard error that each replicator should have reported individually. The last result indicated that variation among researcher decisions, which likely would have escaped documentation and therefore not appear to peer reviewers, was the culprit for such huge variation in outcomes.

Another team, led by Uri Simonsohn, suggested that researcher degrees of freedom arise from two main sources: ambiguity in data-decision best practices and researchers’ drive to publish “statistically significant” results. As an example, Simonsohn and his co-authors point out 30 works in the same psychology journal that dealt with identical and seemingly simple decisions about what data constitute outliers in reaction times and how researchers should deal with them. Despite their similar parameters, the articles exhibited a huge amount of variance across studies. 

The individual researcher’s decisions weren’t incorrect, but the ambiguity of outlier treatment led to radically divergent results. What’s more, because any decision was seemingly justifiable, researchers had direct incentives to make choices that produced the most eye-catching results.

 

Theoretical Variance

The impact of researcher degrees of freedom are not even confined to the realm of empirics. Some years ago, two researchers published a paper that purportedly demonstrated how economists did not fully grasp the concept of opportunity cost (the principle that the cost of a given activity is the foregone benefit of the next best alternative), a fundamental and supposedly straightforward concept in economic decision making. The textbook question the researchers asked of 199 economists was as follows: 

Please Circle the Best Answer to the Following Question:

You won a free ticket to see an Eric Clapton concert (which has no resale value). Bob Dylan is performing on the same night and is your next-best alternative activity. Tickets to see Dylan cost $40. On any given day, you would be willing to pay up to $50 to see Dylan. Assume there are no other costs of seeing either performer. Based on this information, what is the opportunity cost of seeing Eric Clapton?

$0 B. $10 C. $40 D. $50

The answer, according to the textbook definition, is $10. The $50 benefit of Dylan minus the $40 cost of Dylan was the opportunity cost of going to Clapton for free.

As a rebuttal piece showed, however, because no operational standard exists for what constitutes an “opportunity cost,” definitional differences around what is a cost and a benefit in the above question can lead any of the four answer choices to be plausible.

Does the price of the Dylan ticket constitute a benefit because it’s $40 not foregone? What is the monetary value of the Clapton show? Ambiguity in opportunity-cost accounting produced degrees of freedom for a respondent to make different, defensible assumptions about the concert-goer, resulting in divergent answers. 

 

Unskewing the Results

Since most researcher degrees-of-freedom divergences result from ambiguities that arise in data, potential solutions center on both alleviating researchers from having to make these decisions and requiring them to be specific about the decisions they must make. Both Huntington-Klein and Simonsohn propose that researchers include data appendices that document all variables constructed, whether researchers used them or not. 

Researchers also need to document all decisions they make about what data to exclude, modeling decisions that result in non-results and any failed manipulations of data in processing in these appendices. Greater transparency around the messiness of data processing and estimation, even if it blunts otherwise sharp results, is absolutely necessary to ensure researchers are themselves cognizant of the decisions they’re making during the research process. 

Another solution is standardization of data-processing procedures and best-practices guidelines. If many researchers use the same data sources, which occurs frequently in non-experimental research, pre-processing of commonly used data or a best-practices guidebook for use of common data minimize potential sources for researcher decision making. 

The National Bureau of Economic Research’s standardized merging process for the Merged Outgoing Rotation Group Files from the Census Bureau’s Current Population Survey is a good example of this kind of measure. By making the file-matching process and all of the assumptions uniform, the NBER has effectively mitigated researcher variance in how CPS data is combined. This has led to more uniform results by eliminating the need for researchers using CPS data to independently decide how best to combine disparate data files. Pre-processing of common data sources via standard code, guidebooks or both provide excellent methods by which organizations can preclude data-processing ambiguity. 

Another solution that could help mitigate noise in estimation arising from the garden of forking paths is to aggregate disparate estimates that address the same question using identical data. Ensemble or model-averaging methods can be useful to ensure that an estimate arising from multiple strains of independent research is more accurate than any individual strand. More importantly, multiple sources of estimates can reveal how much noise arises from researcher flexibility in data decision-making. 

Daniel Kahneman, Oliver Sibony and Cass Sunstein’s book Noise suggests organizations worried about consistency in judgement and decision making carry out a “noise audit.” This process, much like the Huntington-Klein team’s work, involves soliciting decisions to a pre-selected scenario from expert decision makers multiple times. The audit tests how closely each individual’s average decision resembles the overall average of all decision makers. It also establishes how closely each decision maker’s individual decisions are to his own overall average. In other words, does the decision maker make the same or similar degrees-of-freedom choices in near-identical circumstances? 

As Kahneman, Sibony and Sunstein state, “Judgement is difficult because the world is highly complex and uncertain.” This is doubly true of data collected from uncontrolled, real-world processes. Luckily, the authors also conclude that noise is discoverable and reducible using set rules and guidelines. 

This maxim certainly holds true when it comes to empirical research. Noise arising from different assumptions researchers must make during the research process can be offset with transparency, standardization and, to a lesser extent, aggregation. Researchers must realize what the amount and impact of assumptions they make can be on their final results. Otherwise, as Kahneman, Sidley and Sunstein warn, “Noise is inconsistency, and inconsistency damages the credibility of the system.”  

Expert Contributors

Built In’s expert contributor network publishes thoughtful, solutions-oriented stories written by innovative tech professionals. It is the tech industry’s definitive destination for sharing compelling, first-person accounts of problem-solving on the road to innovation.

Learn More

Great Companies Need Great People. That's Where We Come In.

Recruit With Us