For 85 years, empirical research has relied on the concept of “statistical significance” as a type of methodological gatekeeper. Studies that produce statistically significant results pass the sniff test and are deemed worthy of publication and discussion. Studies that do not are typically cast into the proverbial desk drawer.

The term “statistically significant finding,” the *sine qua non* of empirical research, is ubiquitous throughout academia. As statistical estimation techniques have moved into government and industry research, the concept has become more prevalent to a wider array of leaders and stakeholders. But what does the term “statistically significant” actually mean?

## Statistical Significance and P-Value

- Statistical significance relies on something called a p-value. If a statistical estimate yields a p-value less than or equal to .05, then it is typically considered statistically significant.
- P-values rely on a statistical practice known as null-hypothesis significance testing, or NHST for short. In this method, the researcher starts from a null hypothesis that assumes a true-effect size of zero. From this starting point, the researcher estimates the effect from data by employing some statistical model or procedure. This results in an empirical estimate of an effect posited by the model and evidenced by the data.
- Uncertainty in the estimated effect is codified in a p-value. The p-value is the cumulative probability that, conditional on the null hypothesis being true, the model produces the estimated effect or a larger one by chance alone. Thus, a p-value of .05 would indicate that there is 5 percent chance or less that the data would randomly produce an effect of the estimate’s magnitude or larger from zero.

## Statistical Significance and NHST

Statistical significance relies on something called a p-value. If a statistical estimate yields a p-value less than or equal to .05, then it is typically considered statistically significant and welcomed into the confraternity of reliable results.

This, of course, raises the question: What is a p-value and why does it have to be less than .05? For the curious, you’re not alone: Most researchers cannot answer this question correctly. The actual definition of a p-value requires a bit of explanation of a statistical practice (some would call it a “ritual”) known as null-hypothesis significance testing, or NHST.

NHST is a method by which a researcher starts from a null hypothesis that assumes a true-effect size of zero. In other words, the benchmark assumption before statistical analyses begin is that there is no true statistical effect. From this starting point, the researcher proceeds with estimating the effect from data by employing some statistical model or procedure. This results in an empirical estimate of an effect posited by the model and evidenced by the data.

Uncertainty in the estimated effect is codified in a p-value. The p-value is the cumulative probability that, conditional on the null hypothesis being true (i.e., conditional on there being no true effect), the model produces the estimated effect or a larger one by chance alone. In other words, a p-value of .05 would indicate that there is 5 percent chance or less that the data would randomly produce an effect of the estimate’s magnitude or larger from zero.

On its face, the NHST procedure and the p-value that results have a veneer of scientific respectability: A hypothesis is posited, tested with data and disproven if false. A simple example illustrates one of NHST’s major flaws, however. Start with the question, “Is some person seven feet tall?” An empirical solution to this question would involve a straightforward measurement of the person’s height. Easy enough.

But now, adopt the NHST framework and set a null hypothesis of no difference between some person’s height and that of the average person. A measurement of some person’s height yields a difference from the average person’s height that is not equal to zero. Because the difference in the two heights is not zero, according to NHST, there is evidence in favor of some person being seven feet tall or more.

This example sounds ridiculous. But it’s no different than asking the question “Is this effect size x units?” and using NHST to answer “There is a 5 percent chance or lower that a true effect size of zero produced these data, so this provides evidence that the effect size is, in fact, x units.”

## Statistical Strawmen

The reason this sounds so absurd is because it’s a classic logical fallacy: the strawman argument. NHST sets up an answer to a different question than the one the researcher actually asked. NHST then proceeds to “disprove” the former and uses this as evidence for the latter.

What’s worse, the p-value is unable to reject even the strawman! The computation of the p-value takes as given the null-hypothesis assumption. But, in allegedly rejecting a null hypothesis, NHST has conveniently reversed the conditioning so that the p-value goes from being the probability of the data given the null hypothesis is true to the probability of the null hypothesis being true given the data.

Another way of saying this is that a researcher cannot use the probability of a person being Catholic given that he’s the Pope to answer the question of whether someone is the Pope given that he’s Catholic. Another example of this fallacy is to suggest that because many terrorists have backgrounds in engineering, engineers are more likely to be terrorists. The problem is that the two conditional probabilities assume different reference groups and so they don’t even address the same research question.

P-values also provide a fallacious scientific gloss to statistical estimation. Researchers confuse p-values with the probability of replicating an estimated effect given new data. Psychologist Gerd Gigerenzer refers to this illusion as the “replication fallacy.” It amounts to a researcher believing that one minus the p-value equals replication probability.

In other words, if researchers estimate the probability that an effect size of zero randomly produced the data, they commit the replication fallacy if they believe that collecting more data on the effect in question and re-running the estimation procedure will result in a 95 percent chance or greater that they will once again find statistical evidence of a non-zero effect size. The insidious part of the replication fallacy is that a misunderstanding of what a p-value is results in a false belief that the computation of a p-value renders replication of the research study to confirm or refute initial results superfluous.

To further refute the replication fallacy, Gigerenzer provides the following example. Suppose a researcher wants to know whether a pair of dice are fair or loaded. The null hypothesis is that they are fair. Upon rolling, both dice come up sixes. The p-value that fair dice both land on six is (1/6)*(1/6) = 1/36, or about .03.

Gigerenzer then asks if this result means that there is a 1 - .03 = .97 probability that the next roll comes up sixes. Of course it doesn’t. But most researchers believe that if a p-value is .05 or less, then the probability that an effect of their estimate’s size or greater would have a 95 percent chance of replication if the experiment were repeated.

Because this method for determining statistical significance gained an early foothold in academic publishing, all types of statistical research now rely on it as the best way to separate true effects from random noise in data. It has spilled over into research practices in government, medicine and industry. Although long-established, NHST relies on mischaracterizations and flawed assumptions that fundamentally undermine reliable scientific and statistical practice. As more fields take it up, the ramifications on statistical research could be catastrophic.

## Unstuffing the Strawman

Thankfully, statistical inference and empirical methodology were not always this way, and they certainly do not have to remain so. An easy way to avoid falling prey to the illogic of NHST is by focusing on the research question and the effect under study. This begins by understanding the available data, where it comes from, how it was assembled, its shortcomings and the questions that it can and cannot directly address. As such, the majority of researchers’ efforts should be spent on understanding and being able to describe their data. This runs counter to simply using data as an input for a statistical method or technique so as to produce a statistically significant result.

Emphasizing descriptive inferences about collected information, perhaps via charts, graphics or tables, better serves the researcher’s goal of accumulating evidence in favor of a particular effect. Ideally, statistical studies should use data and context to build a convincing case for or against something. Implicit in this design is the use not only of data, but the interweaving of it with theory, context, history and other strands of information to build on compelling insights.

Quality statistical inference also means a re-emphasis on effect size or impact rather than how well that effect has been measured. Measurement is important, but not in the sense of collecting enough data to ensure a statistically significant estimate. Rather than researchers pumping enough data through NHST to produce statistical significance in the form of a miniscule p-value, good statistical practice should allow the data, as much as possible, to speak for themselves.

This method requires researchers to emphasize descriptions of their data, including the uncertainties within them, rather than forcing data through an estimation process predicated on spurious logic. It also requires researchers to think more deeply about their research questions and the ways in which data, along with other convergent lines of evidence, can illuminate understanding of a problem.

Since NHST came out of academia, the introduction of corrective incentives for good statistical practice could begin there. Tenure committees and journal editorial boards could emphasize academic work that splits research into components. One part of rigorous empirical research could be original, exploratory work and another part could focus on confirming or disconfirming the original ideas that exploratory work suggests.

Another good way is for academic departments to base career mobility opportunities on data collection and production of quality statistics via the replication of statistical experiments. Academic credit for the use of data could function much like references do now. Researchers could reap career benefits from these assembling high-quality data rather than being solely rewarded for producing “statistically significant” results.

This also goes for replication studies. Journals and tenure or promotion review boards should heavily incentivize researchers undertaking to replicate each others’ works. Indeed, government and industry could also benefit from these same incentives. Career advancement should be based on publicizing results that are replicable, working to replicate results or collecting and exploring data independently of the NHST paradigm.

Any movement away from solely awarding researchers for producing statistically significant findings and toward producing high-quality data leading to replicable insights would be a good step toward breaking away from NHST’s methodological stranglehold that cripples quality statistical research.