Software development for statistical, analytical, or empirical purposes was dominated, for the first 30 years, by companies like SAS, SPSS, Minitab, Stata, and others. These companies developed products and sold licenses or tiered-price packages for their data-analytics software. But beginning in the mid-1990s, and especially after 2000, the open-source movement began encroaching into what was once the sole purview of pay-per statistical software. Python jumped from traditional programming into analytics, and the new, stats-specific programming language R arose from the remnants of Fortran and C. These products were freely available, constantly updated, and enjoyed near-instant worldwide distribution.
The most dramatic difference between these new products and the proprietary hegemons of analytical programming, though, concerned development. Open-source languages’ source codes were freely available for modification by any user. This approach departed markedly from the traditional software development model, i.e., hire the best minds from computational statistics or social science, concentrate their talents at or near corporate headquarters, and jealously guard professionally developed source code.
In line with Eric Raymond’s essays, two paradigms of statistical programming have thus arisen. Which is preferable? Of course, both have costs and benefits. In lieu of simply looking at the price of statistical software in monetary terms, though, consider some of the largest non-pecuniary costs for comparison. I argue that the largest perceived costs of open-source software relative to proprietary software are actually not drawbacks at all. Namely, conversion from proprietary legacy to open-source, security risks of open-source relative to proprietary, and the learning-curve gradient of open-source versus proprietary are all either overstated as costs or actually turn out to be long-run benefits.
Most companies aren’t building analytics platforms from scratch. Many organizations have relied on proprietary-software infrastructure to provide analytics solutions and continue to do so. Presumably, conversion of these systems to open-source equivalents is too expensive in terms of redesigns and process-flow interruptions. The opportunity cost of disruption is simply too steep.
This argument misses an important distinction, however. Most automated processes and programming tasks do not fall under the purview of statistical software. Rather, analytics teams typically function as in-house consultancy groups. They take on questions of relevance from major lines of business within an organization and tailor stats software codes to fit the needs of these “clients.” This process produces code repositories that are too specialized to be reusable. Similarly, the end products for most of these analyses are not reproducible software that scales, but individualized reports, tables, visuals, or slide decks that address a line of business’ individual questions. The time required to port the fungible pieces of these programs is often overstated because of the mistaken conflation of analytical-software solutions with software engineering or IT tasks. The argument, then, that converting analytical tasks to another language is as much of a burden as that of changing software infrastructure is belied by the primary task of statistical code: it’s designed to answer ad hoc questions of timely relevance, rather than to permanently automate entire business processes. Thus, most analytical code doesn’t have to be rewritten.
Another argument against open-source software is security risk. Because open-source software source codes can be freely modified, this software can potentially pose organizational threats. The conjecture simply assumes that, because something is open, it poses an elevated risk relative to a paid-for alternative. This is a chimera. Open-source code provides a decentralized defense against security risk. Because source code is open to all, solutions to potential malfeasance can arise anywhere. Moreover, the more people that use open-source for all sorts of needs, the more incentive there is for users to preserve and protect the secure development of the source-code that drives open-source evolution. Proprietary software, however, requires a small group of core developers to know about and effectively counter new risks before they arise. Even assuming perfect risk knowledge, which is a near impossibility, the sheer time-costs of constant updates and re-developments incurred by a small group of developers working in a cloistered environment to counteract ever-evolving security threats would be supremely difficult.
Lastly, the challenge of learning open-source languages vis-à-vis their proprietary cousins represents a potential cost. Proprietary languages can indeed be easier to pick up quickly than open-source alternatives, especially for low-level or heavily used analytical procedures. These procedures are employed regularly and by enough individuals that proprietary languages trade-off quick, out-of-the-box solutions for common analytical tasks against deeper, hard-gained familiarity with scripting analytical programs. Trading off implementation simplicity for procedural understanding, however, can be a blessing and a curse.
Part of the challenge of learning open-source programming languages is how flexible they are. But this flexibility also gives the user more options for adapting open-source procedures to new problems. It does so by allowing users to recombine pieces of existing methods into new analytical procedures. This not only fosters deeper thinking about how and why code works, but also achieves a dual purpose: the simultaneous development of coding processes and new statistical methods. Furthermore, these developments can be simultaneously broadcast to a wider open-source community.
Proprietary analytics companies struggle with development dexterity because their processes are often over-engineered and cannot quickly and efficiently be exposed to a wider pool of potential developers. Thus, proprietary software tends toward short-term thinking: ensuring the ease of user experience and emphasizing status-quo procedures. These come at the expense of creating longer-term incentives for users to more deeply understand how code and the statistical processes driving it can be applied and adapted.
All else equal, the perceived costs of open-source are either overestimated or become beneficial given the timeframe. There are other arguments to be made, but a convincing litmus test is what business customers prefer. Evidence from the last six years does not favor proprietary software, according to a Burtchworks’ 2019 survey of data analytics professionals. The most popular proprietary statistical software, SAS, has been losing market share to both major open-source platforms, R and Python, over the past half-decade (with Python emerging as the winner). What’s especially striking is the movement toward open-source programming languages across years of experience. Those with less than a decade of work experience are roughly five to 10 times more likely to prefer one of the open-source languages to SAS. This trend redoubles itself for future workers: 95 percent of college/graduate students prefer open-source to proprietary. Recent moves toward open-source programming and away from proprietary are a result of organizations across many industries innately “voting with their resources” in favor of open-source solutions. This trend is occurring in spite of, or perhaps even because of, the widely perceived costs.