The British economist Charles Goodhart coined a famous saying in 1975 to describe British monetary policy. Goodhart’s Law, as it came to be known, in his own words, states that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” Thankfully, the anthropologist Marilyn Strathern translated the original text from Economese into this pithy dictum:
“When a measure becomes a target, it ceases to be a good measure.”
This saying is truer today than ever before. The costs of data collection over the past 15 years have plummeted while computer storage and processing capacity has exploded in line with Moore’s Law. Goodhart’s original insight has tremendous relevance in a data-driven world. When collected data is used for allocating punishment and reward, the insights drawn from that data is likely to be overhyped, misleading, or fraudulent. Why? Because, if they have the incentives to do so, humans in organizations respond to metric collection in ways that fundamentally alter the metrics’ meanings.
High-Stakes Testing Data
A famous, recent example of replacing administrative judgment with standardized numerical performance measures, or what Jerry Muller refers to as “metric fixation,” is school systems’ uses of “value-added” testing of teacher and school performance. This type of testing regimen purports to fairly capture the effectiveness of inputs into a pupil’s in-class performance while simultaneously controlling for all of the factors that impact that student’s test performance that are outside of the teacher’s and school’s control. This practice is also referred to as “high-stakes” testing because, in many districts, administrations use test results to evaluate employees for promotions, pay raises, and other financial outlays.
Despite their data-driven nature, these practices can serve to undermine the diagnostic values of testing for teachers’ and schools’ uses in designing or tailoring their curricula, lessons, and class-time allocations to better serve their students. When employment decisions, pay incentives, and other retributive measures are tied to testing outcomes, teachers and schools face clear incentives to focus a huge amount, or even a majority, of classroom time on teaching testing strategies as opposed to subject material. Even within the parameters of the subject material covered by tests, most test questions elide deeper concepts in favor of easily quantifiable, multiple-choice-style question frameworks. This topicality is easy to measure (and punish or reward) in the short run, but comes at the expense of long-term learning: Most students forget algebra and geometry within five years of graduating high school and almost all the math they learned within 25 years. Worse, privileging high-stakes testing outcomes can also lead to such adverse outcomes as “creaming,” the practice whereby weaker pupils are classified as disabled so that they can be removed from the pool of tested students or even to outright fraud.
CompStat Metrics
Another area where basing rewards and punishments on data collection has warped institutional incentives and produced perverse outcomes is policing. In 1994, the New York Police Department introduced CompStat (short for Compare Statistics) as a means to track crime patterns via GPS. The NYPD sought to use CompStat data to proportionally allocate scarce police resources to the most troubled areas of the city. The program was initially successful and helped to dramatically lower New York’s violent crime rates in the mid-1990s. Over time, however, CompStat metrics began to be used as performance measures in determining officer pay, promotion, and dismissal decisions.
What’s worse, the city government became more heavily involved in pressuring the NYPD for crime reductions on the basis of CompStat data. The NYPD’s incentives to fudge the data became obvious. The department would either downgrade serious crimes to minor offenses to lower major crime rates or overpolice minor, easily arrestable infractions to make arrest statistics go up. This practice worked to manipulate the data since every arrest, from a minor drug possession charge to a major arrest of a known, violent felon were weighted exactly the same in CompStat’s arrest counts. As with the educational example, measurements of crime themselves became the objects by which people judged the success or failure of the police and political leaders. This shift inevitably undermined the original goal of CompStat, which was to inform the NYPD which areas were most in need of police resources.
Military and Political Data
The United States military has also fallen victim to Goodhart’s Law. Information collection by military personnel can be critically important in understanding the large-scale impacts of military interventions. Good data on operational and environmental conditions also provides crucial feedback about what strategies and tactics are more or less effective during conflicts. When these metrics become the standards by which combat command is judged, however, the same perverse incentives to skew data in order to secure favorable promotional or administrative outcomes undermine these metrics’ informational contents.
During the Vietnam War, body count was one such metric. The number of Vietnamese dead was the most prominent statistic that the military collected due to its usefulness as a means of convincing the American public that the U.S. was prevailing. Body count was U.S. Secretary of Defense Robert McNamara’s prized metric, even though most field commanders did not trust its validity as an index of battlefield success. Despite their objections, body count weighed heavily in promotional decisions and military assignments throughout the conflict. Even more perversely, however, was that American soldiers were sometimes killed while attempting to quantify Vietnamese casualties after battles in an effort to bolster body counts.
In The Pentagon and the Art of War, Edward Luttwak suggested that statistics like body counts, battle incidents, and other non-territorial measures of military outputs were useless in determining whether campaigns were ultimately successful. The only variables that mattered in eventual military victory were non-measurable, Luttwak suggested, because it was impossible to quantify an enemy’s willingness to fight. He further said that the focus on quantification of military metrics in the short-term came at the cost of the long-term strategic thinking required for victory.
Another field in which short-term metric fixation undermines long-term (and, frequently, impossible-to-quantify) success is international development. Programs that are most likely to receive the lion’s share of foreign aid are the ones that are most easily analyzed by watchdog agencies like the U.S. Office of Management and Budget or the Government Accounting Office. This is true despite the fact that programs with easily quantified results in the present are the least effective long term. When dealing with issues like improving governmental skills in transitioning democracies or instilling civic trust and civil-service norms in skeptical populations, it is simply not realistic to expect good statistical measures of progress over the course of a few months or years. For this reason, aid bodies like the U.S. Agency for International Development waste scarce resources on collecting and broadcasting meaningless metrics like the number of trainings that staff complete or how many conferences aid workers attend to justify their funding every year. These wastes of time and money come at the expense of longer-term, more effective strategies for bolstering worldwide development.
Goodhart’s Law also affects the demand side of foreign aid. Developing countries understand that financial assistance is frequently tied to macroeconomic outcomes. These countries, then, have clear incentives to skew statistical indicators like per-capita gross national income (GNI), population counts, foreign direct investments, and other metrics used by the United Nations and other international bodies in determining where to distribute aid.
Researchers found that this exact pattern of “aid-seeking data management” occurred in per-capita GNI figures of countries that were eligible to receive assistance from the World Bank. These researchers analyzed discrepancies between countries’ data published online versus print editions of these same countries’ GNI figures over the same time periods. Because electronic data is much more easily revised by national offices of statistics, discrepant patterns between GNI figures in the printed World Bank Atlas and in the more up-to-date online data occurred only in those countries right at the threshold of qualifying for World Bank funds at the time of application. After the World Bank processed aid applications for these countries, the discrepancies disappeared.
Data and Science
If misreporting results is heavily influenced by promises of reward, scientific research is another victim of Goodhart’s Law. Indeed, the entire problem of the replication crisis across most fields of empirical academic inquiry is, in no small part, attributable to the Law. The most notorious examples of metric fixation faced by researchers, and journals publishing their works, are, respectively, the “h-index” and “journal impact factors.” Both of these indices purport to measure the overall impacts empirical works have on their fields of knowledge. However, both measures rely on citation counts as well as the volume of output by researchers and publishing entities. Both the h-index and the journal impact factor, by construction, create incentives for scientific researchers to bury research that shows no results or that is likely to go along with the prevailing consensus. Likewise, no incentive exists to replicate or reproduce the results of other scientific research. Good science relies on replication to confirm new results over time as more information becomes available, but replication studies have no or minimal impact on h-indices and impact factors relative to new, eye-catching findings that, though dubious, cut against the on-going consensus.
Richard McElreath and Paul Smaldino refer to this phenomenon as the “natural selection of bad science.” They found that incentives for output volume versus output quality leads to research of diminishing quality and high false discovery rates. Stuart Ritchie, in his book Science Fictions, also demonstrates how misaligned metrics tied to institutional incentives favoring poor quality scientific research serve to undermine the public trust that scientific research is built on. Ritchie points out that, because of the citation and publication count bases of both h-indices and journal impact factors, scientists and journals face pressures to form citation rings that continuously cite each other’s works regardless of their quality.
The Stakes of Big Data
As all of these cases illustrate, even with the best of intentions, data can be manipulated to the point of uselessness when it becomes its own end. As such, when collecting data or constructing metrics, it is critical to ask a variety of questions about how and why these measures could possibly be misleading as well as how they might be illuminating. Jerry Muller suggests a number of germane questions in his book The Tyranny of Metrics. This checklist centers on determining the usefulness of the information collected in the context of the environment in which it is to be collected. In particular, he mentions asking how the people within an organization who are being measured might respond to that measurement. This question is all the more important when reward and punishment are directly associated with metrics. Along these same lines, it is important to consider who is designing metrics and for what purposes they are being collected.
Metrics are most useful in low-stakes situations where they are used by practitioners to shine light on what large-scale processes might be at work or to discover insights about process improvements. It is also crucially important to consider secondary effects of measurement within organizations. A focus on short-term measurement can come at the expense of long-term objectives. The bottom line is that, when metrics become intertwined with institutional incentives for skewing, misrepresenting, or fudging them in order to attain reward or to avoid punishment, organizations must be aware of Charles Goodhart’s timeless advice.