Are Empirical Bayes Models a Data Science Cure-All?

The most common argument leveled against practitioners of Bayesian statistics by their Frequentist cousins is one of subjectivity. The use of prior distributions in estimation is not scientific, so the argument goes, relative to the idea of allowing the data to speak for themselves. Though many excellent arguments exist to counter this common tenet of classical statisticians, Bayesians have not yet pushed them hard enough. This has proven especially true in recent years, as Markov Chain Monte Carlo methods, the ever-expanding efficiency of statistical software, and the open-source movement’s unseating of some of the proprietary dinosaurs have all ushered in a new era of Bayesian methodological opportunity. Sadly, though, Bayesians are squandering this opportunity to unseat “scientific” statistics as the status quo. Despite this lack of a total triumph, however, not all has been lost for Bayesian methods.

A form of watered-down, prior-free Bayesian methods have been gaining in popularity due largely to breakthroughs in computing efficiency that allow for large-scale, structured data processing. Colloquially referred to as empirical Bayes, these methods seek to combine pieces of Bayesian inference with a Frequentist methodology to replace prior assumptions that Frequentists often deride as non-scientific. In doing so, Frequentists believe they can have their Bayesian cake and eat it too. As Bradley Efron and Trevor Hastie state in Computer Age Statistical Inference, “large sets of parallel situations carry within them their own Bayesian information.” These authors also report that “empirical Bayes removes the Bayes scaffolding. In place of a reassuring prior ... the statistician must put his or her faith in the relevance of the ‘other’ cases in a large data set to the case of direct interest.” What does this mean in practice?

Where Frequentist statistics break down — and Bayes shines — is with small samples or groups without a lot of recorded data. Without enough information on a group or groups within structured data, derivable insights will typically be lost in noise or inestimable. What empirical Bayes allows for is a balancing act between the universe of groups within the data and each individual group. This comes by way of a phenomenon called “shrinkage.” Sparing the reader the technical details (although an excellent applied resource for further information is David Robinson’s Introduction to Empirical Bayes), the central idea is that groups within data resemble each other enough to yield insights about each other.

The best way to think about what shrinkage does is with a baseball analogy, as in David Robison’s book or Bradley Efron and Carl Morris’s famous 1975 work. The typical MLB player has a batting average he accumulates over his career that illustrates how effective he is at hitting. Players in a given season are at different points in their careers, however. Some are just starting out, and others are well established. Attempting to estimate a career batting average for rookie players who haven’t had many at-bats (and discounting their performances in other leagues) is difficult, if not impossible. Empirical Bayes solves this problem by allowing the statistician to combine the overall average of all MLB players at a given time with the newer players’ limited hitting data. Assuming most MLB players’ batting averages are going to look like the league average, the empirical Bayes approach uses this “mean MLB batting average” as a sort of stand-in for what little information is known about rookie players’ hitting abilities. Thus, the combination of player-specific data with the league-wide data serves to “shrink” player averages computed from only a few at-bats toward the average player’s at-bat prowess. As evidence of new players’ hitting abilities accumulate over time, however, this shrinkage effect is eventually outweighed by the accumulated player-specific information that better captures their abilities. Thus, shrinkage estimators allow for a kind of borrowing of information from the entire group of major league batters before enough information on any one player’s hitting ability can be accurately estimated.

Another illustrative example is educational test scores. Many school districts use student test scores to inform teacher promotion and pay decisions. But some teachers have far fewer students in their classes than others do. When teachers have only a few students, their statistically estimated effects on students’ scores can be clouded by noise or even inestimable given data sparsity. It’s far more likely that five students in a class of six get perfect scores, for instance, than if a teacher had a classroom of 30 students. Empirical Bayes allows all teachers to start at the average and move further away from it, either positively or negatively, based on how much individual evidence accumulates in teachers’ classrooms.

Again, Frequentist models simply cannot handle these sorts of low-data situations. Conversely, assuming enough information is collected on all batters/teachers/groups, the results of empirical Bayes analyses will largely coincide with common statistical estimation techniques. Additionally, as David Robinson points out, empirical Bayes is an excellent method to quickly scale shrinkage estimation to huge amounts of data. This indirectly addresses another knock against full Bayes methods, namely that these techniques cannot be used for typical large-scale data tasks due to computational and methodological constraints. Empirical Bayes techniques don’t face this problem. Further, these methods are easily scaled and results can be updated nearly instantaneously when presented with new data.

With all of the promise of empirical Bayes estimation, what are the drawbacks? After all, as economists are wont to point out, there’s no such thing as a free lunch. Empirical Bayes suffers from an intractability problem. Outside of a few conjugate distributions that can be combined to produce an estimating equation, there is simply not much room for empirical Bayes to exist because the Bayes portion of the shrinkage estimator is insoluble. That is, there is not an equation for the statistician to plug data-wide estimates of the mean into. Empirical Bayes also can and often does over-shrink estimates. This tamps down on spurious, extreme observations but can also cover up legitimate outliers containing important underlying information. Still, these issues shouldn’t detract from the power of empirical Bayes estimates to outperform their Frequentist analogs. Indeed, in a seminal 1961 work, Willard James and Charles Stein mathematically proved that, not only do shrinkage estimates outperform classical estimates, they do so in all circumstances, in terms of squared-error loss, regardless of the estimate’s true value.

Empirical Bayes provides a quality, stop-gap measure between the classic, Frequentist conception of statistical estimation and the full Bayes estimation techniques that have come into their own recently. Empirical Bayes, though, is just that: empirical. The main question is: Can empirical Bayes, or some future iteration of it, provide a way to synthesize classical and Bayesian statistical paradigms? In other words, does empirical Bayes provide, more than its empirical bona fides, a way forward toward a convolution of Frequentist and Bayesian theory? Unfortunately, empirical Bayes cannot provide a coherent synthesis between its two intellectual forebears. Its promise lies in its ability to amalgamate useful bits of major statistical modeling strategies while circumventing the substantial differences between Frequentist and Bayesian statistics. Its piecemeal manner of utilizing existing techniques cannot ultimately form a coherent statistical framework. This renders empirical Bayes a mathematical artifice, a useful kludge. In the end, empirical Bayes is an important tool in the statistician’s toolbox for use with large-scale, hierarchically structured data. And, perhaps, that’s all that it needs to be.

Are Empirical Bayes Models a Data Science Panacea?

Recent Data Science Articles