Data science and big data analytics have become the new must-haves for businesses across many industries. Gone are the days when algorithm development and large-scale data mining were confined to Silicon Valley. In the modern, tech-savvy age, it’s almost an afterthought that banks, insurance brokerages, healthcare entities, and other non-tech-sector companies seek to be “the next Apple/Google/Amazon” or whatever tech behemoth completes the C-suite’s bromide. This is true not just in word, but in deed.
Over the past decade, many companies have invested tens or even hundreds of millions of dollars into both digital capital and the workforce to use and maintain it. Large-scale storage clusters capable of holding tera- or peta-bytes of information contribute substantially to these costs. Data collection and warehousing needs imply substantial up-front capital costs. Companies also typically underestimate costs of converting between legacy and new systems. The largest costs in digital, however, occur in the labor realm.
Data scientists, statistical-analytics consultants, predictive analysts: these professionals have commanded hefty compensation relative to other occupations since 2010. According to a survey carried out by Burtch Works Executive Recruiting, non-managerial data scientists’ 2018 median salaries ranged between $95,000 and $165,000 across experience levels. Data science managers’ compensations ranged between $145,000 and $250,000. In the ever-changing digital landscape, the only constant is that payroll dominates data-infrastructure costs.
The development of both modeling complexity and modeling scale have grown due to increases in data scientists’ skill sets becoming more mathematically rigorous over the past decade. This is because of the expanding ability of software to automate low-level data tasks historically performed by junior data scientists or data analysts. This implies both the previously mentioned increase in salaries for data scientists, whose jobs now require a STEM-field master’s or doctoral degree, and increases in the level and sophistication of machine-learning algorithm development, as highly specialized data scientists now have more time for model development. With massive financial resources now being pumped into big data, purchasing companies must expect its predictive benefits to outweigh its costs. Unfortunately, recent evidence suggests that model complexity does not imply predictive accuracy. Indeed, most data-scientific models are frequently out-predicted when pitted against simpler alternatives.
Spyros Makridakis has made a career pitting forecasting methods against one another in model prediction tournaments. His most recent tournament ended in May of 2018. This was the fourth such tournament Makridakis and his team have hosted (a fifth tournament is underway presently). A major result of the M4 tournament was that, contrary to what most data scientists claim, compilations of simpler methods routinely performed better than any single method. Most dishearteningly for data scientists, though, was Markridakis’ finding that the worst-performing methods, in terms of forecasting accuracy, were machine-learning algorithms.
The likely cause for this is that complex models that require huge amounts of data misconstrue noise in the data for legitimate signals. This phenomenon is called overfitting. Nate Silver described it as “the most important scientific problem you’ve never heard of.” Overfitting occurs when a model too closely fits past data. This data can contain legitimate trends that are valuable for predicting outcomes, but the same data can also contain a great deal of random noise. The difficulty lies in constructing models that are high powered enough to capture these trends while not being so high powered that they also capture randomness that appears trend-like. When the latter occurs, the model overfits past data.
An example is the Google Flu Trends failure. In 2008, Google attempted to predict where influenza-like illnesses would occur using a complex algorithm that company researchers applied to millions of daily search terms. In a Nature article, Google claimed it could predict outbreaks of flu, in terms of doctor visits, two weeks before the Centers for Disease Control and Prevention. Unfortunately, Google Flu Trends massively over-predicted flu caseloads, especially from 2011–13. A contributing reason was that Google’s researchers had made the algorithm more complex following earlier failures in the wake of the 2009 H1N1 pandemic. After Google scrapped the project in 2015, it made its data and predictions publicly available. Recently, a team of researchers from the Max Planck Institute re-estimated predictions of flu-related doctor visits from the 2008–13 period using an almost comically simple method: “predict that the number of flu-related doctor visits is equal to the number of visits two weeks ago.” These researchers’ predictions subsequently outperformed those of Google Flu Trends. Why? The answer is counterintuitive: Their model is too simplistic to misconstrue noise for signal.
These examples bring us back to the earlier question: Do the costs of resources allocated to digital and big data infrastructure outweigh the benefits? Evidence is starting to mount that the answer is likely yes. Knowing this, could there be alternative purposes for companies’ investments in big data? Maybe it’s an extreme case of keeping up with the Joneses. Perhaps it’s a cover-your-ass measure that casts models as villains rather than bad judgment or misaligned incentives. There might even be an “astrology effect,” where the seductiveness of prescience outweighs the critical (and hugely undervalued) ability to admit ignorance. Whatever the reasons, what’s obvious is that, as companies, economy-wide, pour more resources into digital infrastructure the less profitable these investments appear to become.
What’s to be done, then? One solution would be to adopt a prediction-tournament style evaluation of models within a company. Data science managers can select hold-out samples of newly collected data that modelers are not allowed to see or use for model estimation. These hold-out samples serve as benchmarks for data-science teams’ predictions. Furthermore, all models should be pitted against each other to see which model best predicts hold-out sample data. A Makridakis-style predictive tournament approach thus helps filter out models that overfit and motivates new thinking about why specific models produce better or worse predictions. Most importantly, this development environment can shed light on how best to target resources among an increasingly vast array of digital-infrastructure options.