According to Max Roser, an economist and founder of the website Our World in Data, the most critical issue when applying data-driven solutions to the most pressing problems of human progress does not involve artificial intelligence. Neither does it involve advancements in algorithmic development for prediction or insight or anything else. The next great step forward lies in clean, consistent CSV files that are well documented with a README file. This trivial-sounding yet crucial component of data analysis is mostly either ignored or assumed to have already been solved. This is a chimera, however.
We can still see a huge amount of user error at the data-entry point despite all of the advances in computing power and automation over the past 10 years. In fact, it was estimated that 88 percent of existing spreadsheets contain errors. A European non-profit organization exists that is dedicated to the problem of business risk derived from poor spreadsheet practices (the European Spreadsheet Risks Interest Group). The answer, as Roser points out and this article documents, lies in the simple, grueling work that no one wants to do: Checking and re-checking data-entry errors to minimize their subsequent effects on the conclusions we can glean from recorded data.
Spreadsheet Failure Case Studies
Roser’s argument came in response to an Excel-related error that the British National Health System (NHS) committed in October. The popular spreadsheet program that comes bundled with Microsoft Office had a row limit built into it. The NHS department tasked with collecting and disseminating information on coronavirus-related testing and contact tracing neglected this row limit. Subsequently, the computed number of cases for the week of September 25th to October 2nd was initially reported as 7,000. In actuality, the number was closer to 11,000. More shockingly, for the northwest of England, which was then suffering a terrible outbreak, correcting the spreadsheet error led to a 93 percent increase in the weekly case count.
Another famous Excel-related error that led to a massive, adverse outcome comes from an academic analysis of the relationship between public debt and average GDP growth. The problem started when a graduate student named Thomas Herndon discovered that he couldn’t replicate the results of a major economics paper by Harvard University economists Carmen Reinhard and Kenneth Rogoff. Herndon and his professors (whose class assignment he had been struggling to complete) wrote to Reinhard and Rogoff to acquire their original spreadsheet. To their credit, they shared their data with Herndon and his supervising professors. Herndon and the others immediately spotted several Excel-related errors, most notably that Reinhard and Rogoff had left out five of the 20 countries they were considering. Given how important this 2010 paper had been in informing national and international governments’ financial policies in the wake of the sustained economic downturn, the fact that conclusions on this scale were drawn as byproducts of data-entry errors was hugely disturbing.
The Simple, Unsexy Solution
Raymond Panko is a professor at the University of Hawaii who studies data-entry and Excel-related errors in businesses and organizations. In his work, Panko has found three issues related to the prevalence of spreadsheet-related errors. The first was that per-cell errors tend to be rare, but in large data spreadsheets, there tends to be at least one bottom-line value error. More disturbingly, Panko’s second and third conclusions were that cell entry errors, once committed, were almost impossible to detect. Spreadsheet developers and the organizations they work for were highly overconfident in their abilities to produce an accurate product, exacerbating problems. Echoing Edward Lorenz, Panko pointed out that small per-cell error rates of 1 to 5 percent compounded over many rows of data to produce wildly misleading statistics. Furthermore, once individuals committed an entry or keying error, the ability of any individual to identify and correct it was extremely limited and did not appreciably improve with experience. Panko’s main remedy was to employ teams of spreadsheet and data-entry checkers. Doing so has worked well for software development and, he suggested, could be a boon to minimizing data-entry and spreadsheet mistakes.
Unfortunately, the costs of building these labor-intensive error-correction teams are, in Panko’s estimation, on the order of 20 to 40 percent of development resources. Potential mitigation procedures that allay simple mistakes in spreadsheet-recorded data suffer from a mismatched incentives problem in many organizations. Major systematic or database changes, typically cloaked in terms of “efficiency,” are what get people promoted. Panacean solutions like system switches or adding on more layers of storage capacity via Hadoop clusters, cloud computing or other buzzword-inspired platforms are intuitively more appealing — and better for an individual’s career trajectory — than spending roughly a third of development resources to form teams that check other’s work. Indeed, the latter solution smacks of redundancy, an oft-derided concept in business process management. But the truth is that fix-all solutions to routine data-entry problems do not now and never have existed. Just like Microsoft Word’s spell checker cannot ensure the style and communicability of a written document, human checking in the form of more resources devoted to the hard, mundane, grueling, crucial work of checking and cleaning spreadsheets (as well as documenting what they contain) is the best and, as of now, most feasible path forward to avoiding the types of unseen errors that can severely skew conclusions drawn from data analyses.
Improve Your Data Processing
Panko’s suggestion that teams of three to eight employees function as data checkers is an ideal solution to this problem. But, for small or even midsize entities, this might not be feasible. So, what other steps can organizations take to ensure the quality of their spreadsheets? A few less resource-intensive rules of thumb exist that could also help to cut down on spreadsheet errors before they start. These involve slowing down data collection, a combination of automated entry where possible and focusing on the trade-off between carefully collected and documented information on a small number of important factors versus wide-scale, context-free data assembly.
It helps to minimize the amount of hand-entered information whenever possible. Names can often come from online databases, especially first names. Background information that is standardized over time, such as university titles, majors, names of cities, states and so on do not need to ever be hand-entered. Automation of these processes cuts down on key-entry errors and misspellings. It also helps to assign unique identifying numbers to individual transactions, employees and locations. Relying on any sort of typed-out identifiers, such as addresses or names, to link data across enterprise systems can result in massive errors of identification and missing information.
Organizational management should always keep in mind that there is a trade-off between the quantity of data collected and its quality. As Panko’s research suggested, even with small per-cell error rates, the likelihood that errors will occur grows as more data is collected and processed. Given this fact, it is incumbent on organizations, and especially managers responsible for setting data-driven objectives, to think critically about what information is important enough to collect, what information can be accurately measured and what organizational questions this data addresses. Organizations must adhere to collecting specific data for compliance purposes in line with state and federal mandates, but this is all the more reason to institute good data collection practices — avoiding potential audit or legal noncompliance issues.
Lastly, “Roser’s Rule” of compiling clean CSV spreadsheets of data that is well-documented with accompanying text files delineating their contents should be the gold standard for data production at every level. Not only does this ensure near-universal usability, but it also functions as a check against frivolous, error-ridden data collection. Appropriate documentation also helps to focus attention on exactly what data should contain. This can help employees compiling the data to spot errors by the act of documentation itself. All of these suggestions go nowhere without managerial buy-in. Thus, a shift of focus (and resources) away from sweeping solutions back to the straightforward, boring, critical tasks of entry-level spreadsheet checking by managing agents is an excellent first step to signal organizational dedication to the oft-understated hard work required to produce high-quality data that best informs organizational objectives.