The Mexican-Canadian novelist Silvia Moreno-Garcia once said, “Beauty attracts beauty and begets beauty.” Satoshi Kanazawa, in his article for the Journal of Theoretical Biology, sought to test this turn of phrase against birth rates. Kanazawa posited that couples with higher-than-average levels of physical attractiveness (as rated by surveyors) were more likely to birth daughters than homelier-rated couples. Kanazawa used results from a survey of 3,000 American parents rated across five attractiveness levels to estimate that the ratio of daughters to sons was 56 percent to 44 percent for parents in the uppermost attractiveness category and 48 percent to 52 percent for parents in all other categories. This 8 percentage-point difference estimate had a standard error of 3 percentage points, thus making the finding statistically different from zero (two standard errors removed from zero).
The theory was empirically verifiable. Attractiveness bred attractiveness (with the opposite tacitly understood).
What Are Priors?
When Numbers Lie
But wait. If that were true, wouldn’t the same logic apply to height? Intelligence? No other heritable physical characteristics exhibit anywhere near this large of a percentage-point difference in birth-sex ratios between groups. Even more troubling, a survey sample size of 3,000 is by no means trivial. In most circumstances, a sample size of 3,000 to study a difference in proportions is quite appropriate.
For instance, assuming an effect of zero (i.e., 50 percent girls and 50 percent boys) with a sample size of 3,000, where, as in Kanazawa, there are 2,700 parents across the “not-most-attractive” categories and 300 dazzling parents, estimation of the birth-sex-ratio difference would result in a standard error of 2 to 3 percentage points. This is roughly what Kanazawa found.
The problem, however, is that sex-ratio differences have been studied extensively for hundreds of years, and difference estimates never come close to 8 percentage points. The unconditional birth-sex ratio in the U.S. is roughly 48.7 percent girls with a 0.5 percentage-point error. A sample size much larger than 3,000 is necessary to accurately measure a true sex difference given how small the scale of differences in birth-sex ratios is. Kanazawa’s estimation routine, however, because it was confined to only the 3,000 observations in his survey, had no systematic way to incorporate this information.
Remember Your Priors
This problem arises because the field of statistics, over the past 100 years, has largely discarded or ignored encoding information in the form of assumptions about what researchers think estimates will look like before bringing data into their analyses. These prior assumptions (or, more formally, “priors”) form a crucial component of the modeling process. Priors clarify exactly how researchers believe their models will produce estimates.
As a majority of statistics is practiced presently, however, priors are paid mere lip service if they are acknowledged at all. Data are meant to speak for themselves; to represent an objective assessment of a problem, a “view from nowhere.” This approach results in a veneer of objectivity, but researchers failing to build prior assumptions more directly into their statistical estimations can lead to totally incoherent conclusions like Kanazawa’s.
Empirics turns on gathering data to answer questions of interest to society. The very act of data generation implicitly assumes that data are useful in addressing problems. This assumption, of course, implies that someone, somewhere has thought about these problems before data collection began. In other words, there is never any such thing as an absence of prior assumptions.
Furthermore, without any preconceived notions of empirical topics of interest, how would researchers determine what questions were most worthy of study? Darts at a question board? Topics out of a hat? It’s ridiculous to assume that an empirical researcher would formulate some question based on subject-matter expertise, think through what data could address said question, discern how best to model the question, and then discard all this information once data become available. Heroic attempts to “allow the data to speak” can lead to faulty conclusions, like Kanazawa’s, or outright absurd ones, like when Daryl Bem published his decade-long study “showing” that ESP was real in the Journal of Personality and Social Psychology.
Failing to specify prior information and incorporate it into statistical analyses ignores relevant context, even in data collection. It’s not an arbitrary prior decision for researchers to assume that adult human heights don’t go above about nine feet or go below one foot, proportions cannot go below zero or above one, speeds cannot exceed that of light, and so on. If statistical models produce ranges of estimates that go beyond these values, however, are they somehow objectively true because that’s what the data suggest?
The major criticism of incorporating priors directly into statistical estimation is that the choice of them simply adds another level of arbitrariness to researcher decision-making (alongside other “researcher degrees of freedom,” e.g. data transformations, eliminating outliers, etc.) Like the solution to other problems of researcher degrees of freedom, however, pre-registration of statistical design, including documentation of prior assumptions, mitigates this problem. Besides, most of the problems of irreproducibility occur during the data cleaning phase.
The real issue driving statistical flim-flammery is the lack of transparency about researcher decision-making rather than how arbitrary these decisions seem. Documenting how, where and why researchers make assumptions about statistical processes, employ specific statistical methods, clean and transform data, and what researchers choose to leave out of this process are all areas that are rife with potential pitfalls that could lead researchers down the proverbial primrose path. Clarifying all of these prior to researchers’ beginning formal estimation procedures helps both these researchers and their audiences stay aware of when data-based outcomes stretch credibility.
Several ways are available to use prior information to inform statistical insights. All these suggestions focus on thinking more critically about how data can be used to inform the pre-existing conception of an empirically addressable question. Furthermore, these suggestions are not mutually exclusive. They can all work together to produce better insights from the modeling process.
The most obvious way of systematically including priors is by establishing prior distributions and using a Bayesian approach. This makes prior assumptions into explicit distributions representing various probabilities of potential outcomes. These prior distributions are then combined with the data distribution via Bayes’ formula to systematically weight the prior and the data distribution by the amount of information each contains. The output of Bayes’ formula is the posterior distribution, which is a probability distribution that re-evaluates the prior probabilities of outcomes in light of the probabilistic inferences suggested by the data distribution. Thus, applying Bayes’ theorem to update existing prior assumptions with data allows researchers to make probabilistic inferences about those priors.
As an example, if Kanazawa had begun with a prior distribution centered on 47.8 percent with a standard deviation of 0.5 percent, as the U.S. birth-sex ratio suggested, would have remedied Kanazawa’s spurious results by producing a more credible posterior estimate of no impact from subjective beauty ratings on birth-sex ratios. In this case, although Kanazawa’s sample size was fairly large, it was nowhere near large enough, on its own, to be able to identify a believable effect on the order of a 0.5 percentage-point standard error. Formally incorporating a more believable prior probabilistic starting point for the U.S. birth-sex ratio and working from there would have produced a posteriori probabilities pointing strongly to no change in the birth-sex ratio from a change in subjective beauty scores.
Prior distributions, or, even better, prior predictive distributions, which produce potential outcomes from a model on an interpretable scale (i.e., they produce model forecasts before data is applied to the model) can be incorporated into pre-registration. This allows for the review of potential model outcomes before data are brought to bear on the question. It also clarifies whether underlying assumptions potentially produce spurious or non-credible results.
Incorporating relevant expertise from those without formal statistical training is also much more easily facilitated with the production of these synthetic or forecasted outcome data. Prior predictive distributions actually produce synthetic output data. For instance, a prior distribution centered on 47.5 percent with a standard deviation of 0.5 percent (as Kanazawa might have employed) can be used to synthesize output data on potential birth-sex ratios many times. Demographers or pediatricians could then evaluate various elements of these prior-distribution-produced predictions, such as maximum values, minimum values, or other statistical markers to benchmark what they know theoretically. For instance, if a prior predictive distribution regularly produced a birth-sex ratio of 100 percent, there might be some trouble with the assumptions underlying the model.
Fully Bayesian approaches are not the only method by which prior expertise can be brought to bear on statistical estimation. The formulation of new theoretical models can be a boon to statistical inference within fields. Rather than relying on “workhorse” models like linear regression, ANOVA, or principal component analysis, researchers should attempt to formulate original mathematical relationships between predictors and outcomes in domain-specific contexts. Doing so can clarify new or unknown relationships that data can then verify.
Formulating a theoretical model and then using data to evaluate it is much more in line with the scientific method than attempting to contort prior contextual understanding into a “context-free” form to fit into a pre-existing model framework. The latter is a recipe for producing spurious, irreplicable outcomes. For instance, rather than attempting to predict what the new temperature will be given a change in atmospheric pressure by computing a simple linear model estimating the ratio of changes in pressure to changes in temperature, it is much more accurate to plug temperature and pressure into a derivative of the combined gas law, which mathematically posits a theoretical scientific relationship between pressure, temperature, and the volume of a gas.
If context does not allow or complicates theoretical model formulation, thinking through systems via impact charts or directed acyclic graphs (and pre-registering these) clarifies to a discerning audience researchers’ thought processes undergirding their choices of predictors to use in standard statistical models. It also adds clarity around what predictors researchers are prioritizing and the relationships between predictors. Although these tools do not have the same amount of mathematical rigor as using a Bayesian approach or formulating theoretical mathematical relationships between predictors and outcomes, they can go a long way to encoding prior assumptions researchers are operating under.
At the very least, running estimated results on an interpretable scale by those without a statistical background but with context-specific expertise can potentially help avoid screwball results. Amazingly, non-wonks have a helpful role to play in checking to make sure model conclusions resemble reality. Helpful clarification from a city planner asking “How exactly does living beside a freeway cause autism?” or a professional diplomat suggesting that there isn’t really any credible evidence, based on their experience, that universal health care breeds terrorism can go a long way in preventing absurd empirical conclusions from entering the public record. Ridiculous as these estimates and other spurious results sound, they occur with alarming frequency.
Reaching out to non-statistical experts or including them on research teams and running initial results by them to at least ascertain whether estimates jibe with previous work or not (and if not, why not) are good ways to get fresh perspectives on whether and where estimates might have gone awry. The very act of describing empirical results in an interpretable way to someone with contextual but not statistical training could lead researchers to catch faulty assumptions or spurious results. Changing the descriptive context allows for a fresh look at the veracity of research results.
Better Priors Make Better Research
The assumption that prior information should be considered tangentially or qualitatively misses the crucial role that the systematic use of priors can play in preventing spurious results due to poorly collected or sparse data. Even further, seriously considering priors can operate as a form of model verification and can be used to incorporate qualitative, contextual, or unmeasurable expertise directly into statistical analyses. Finally, there is no such thing as statistical objectivity. This concept is totally incoherent and has functioned as a cover for poor statistical work for too long.