Data Engineers Build. Data Scientists Analyze.
Imagine a data team has been tasked to build a model. It could be any kind of model, but let’s say it’s one that predicts customer churn. What concerns need to be addressed when getting started?
First, there are “design” considerations, said Javed Ahmed, a senior data scientist at bootcamp and training provider Metis. That includes things like what kind of algorithm will be used, how the prototype will look and what kind of evaluation framework will be required.
There are also, broadly speaking, “implementation” considerations — making sure the data pipeline is well-defined, collecting the data and making sure it’s stored and formatted in a way that makes it easy to analyze. If the model is going into a production codebase, that also means making it consistent with the company’s tech stack and making sure the code is as clean as possible.
Before any analysis can begin, “you’ve got to make sure that your customer information is correct,” said Ahmed, who helped build analytics applications for Amazon and the Federal Reserve before transitioning to data-related corporate training. “And that involves a lot of steps — updating the data, aggregating raw data in various ways, and even just getting it into a readable form in a database.”
Ahmed’s central breakdown is, of course, second nature to data professionals, but it’s instructive for anyone else needing to grasp the central difference between data science and data engineering: design vs. implementation. Data scientists design the analytical framework; data engineers implement and maintain the plumbing that allows it.
Data Scientist vs. Data Engineer
Why are such technical distinctions important, even to data laypeople? Because few business professionals — and even fewer business leaders — can afford to be data laypeople anymore.
“If executives and managers don’t understand how data works, and they’re not familiar with the terminology and the underlying approach, they often treat what’s coming from the data side like a black box,” Ahmed said. “They may not fully appreciate what to look for in terms of how to evaluate results.”
What, Exactly, Do Data Scientists and Data Engineers Do?
The mainstreaming of data science and data engineering — when appending all business decisions with “data-driven” became fashionable — is still a relatively recent phenomenon. But core principles of each have existed for decades. “The volume of data has really exploded, and the scale has increased, but most of the techniques and approaches are not new,” Ahmed said.
For instance, age-old statistical concepts like regression analysis, Bayesian inference and probability distribution form the bedrock of data science. The statistics component is one of three pillars of the discipline, explained Zach Miller, lead data scientist at CreditNinja, to Built In in March. “One is programming and computer science; one is linear algebra, stats, very math-heavy analytics; and then one is machine learning and algorithms,” he said.
Here’s our own simple definition: “[D]ata science is the extraction of actionable insights from raw data” — after that raw data is cleaned and used to build and train statistical and machine-learning models. Domain expertise is key to understanding how everything fits together, and developing domain knowledge should be a priority of any entry-level data scientist. Data scientists are also responsible for communicating the value of their analysis, oftentimes to non-technical stakeholders, in order to make sure their insights don‘t gather dust. Familiarity with dashboards, slide decks and other visualization tools is key.
Pillars of data science
- Computer programming
- Statistics and linear algebra
- Machine learning and algorithms
Pillars of data engineering
- Big data storage and processing
- Data pipelines
- Model ETL (Extract, Transform, Load)
Data engineering, in a nutshell, means maintaining the infrastructure that allows data scientists to analyze data and build models. Though the title “data engineer” is relatively new, this role also has deep conceptual roots. What bedrock statistics are to data science, data modeling and system architecture are to data engineering.
System architecture tracks closely to infrastructure. Depending on set-up and size, an organization might have a dedicated infrastructure engineer devoted to big-data storage, streaming and processing platforms. Think Hadoop, Spark, Kafka, Azure, Amazon S3. Without such a role, that falls under the data engineer’s purview.
Likewise, data modeling — or charting how data is stored in a database — as we know it today reached maturity years ago, with the 2002 publication of Ralph Kimball’s The Data Warehouse Toolkit. Needless to say, engineering chops is a must. If you were to underline programming as an essential skill of data science, you’d underline, bold and italicize it for data engineers.
Sometimes a Fluid Situation
Since data science took off around the mid-aughts, the role has become fairly codified. It’s a given, for instance, that a data scientist should know Python, R or both for statistical analysis; be able to write SQL queries; and have some experience with machine learning frameworks such as TensorFlow or PyTorch.
But that’s not to say every company defines the role in the same way. Take perhaps the most notable example: ETL.
ETL stands for extract, transform and load. It refers to the process of pulling messy data from some source; cleaning, massaging and aggregating the formerly raw data; and inputting the newly transformed, much-more-presentable data into some new target destination, usually a data warehouse. (Note: Since the advent of tools like Stitch, the T and the L can sometimes be inverted as a streamlining measure.)
ETL is more automated than it once was, but it still requires oversight. That’s traditionally been the domain of data engineers. In that sense, Ahmed, of Metis, is a traditionalist. He said having the ETL process owned by the data engineering team generally leads to a better outcome, especially if the pipeline isn’t a one-off.
“If you’re building a repeating data pipeline that’s going to continually execute jobs, and continually update data in a data warehouse, that’s probably something you don’t want managed by a data scientist, unless they have significant data engineering skills or time to devote to it.” he said.
But that’s not how it always plays out. Data scientists at Shopify, for example, are themselves responsible for ETL. “The data scientists are the ones that are most familiar with the work they’ll be doing, and in terms of the data sets they’ll be working with,” said Miqdad Jaffer, senior lead of data product management at Shopify.
The similarly data-forward Stitch Fix, which employs several dozen data scientists, was beating a similar drum as far back as 2016. “Engineers should not write ETL,” Jeff Magnusson, vice president of the clothing service’s data platform, stated in no uncertain terms.
His argument is worth quoting at length:
“For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL. It also means ownership of the analysis of the data and the outcome of the data science.”
Company size and employee expertise level surely play a role in who does what in this regard. Organizations like Shopify and Stitch Fix have sizable data teams and are upfront about their data scientists’ programming chops. Smaller teams may have a tough time replicating such a workflow.
“Not all companies have the luxury of drawing really solid lines between these two functions,” Ahmed said. “There’s often overlap.”
Keeping Data Scientists and Data Engineers Aligned
Of course, overlap isn’t always easy. Whenever two functions are interdependent, there’s ample room for pain points to emerge.
Speaking of ETL, a data scientist might prefer, say, a slightly different aggregation method for their modeling purposes than what the engineering team has developed. But the engineering side might be hesitant to switch, depending on the difficulty of the change, Ahmed said.
Another potential challenge: The engineer’s job of productionizing a model could be tricky depending on how the data scientist built it. Ahmed recalled working at an organization with a fellow data scientist who was highly experienced, but only used MATLAB, a language that still has some footing in science and engineering realms, but less so in commercial ones. Hardly any data engineers have experience with it. “That causes all sorts of headaches, because they don’t know how to integrate it into the tech stack,” he said.
But even being on the same page in terms of environment doesn’t preclude pitfalls if communication is lacking. Say a model is built in Python, with which data engineers are certainly familiar. The engineering side could potentially jump into the prototype and make changes that seem reasonable to them, “but might just make it harder for the original author to understand,” Ahmed said.
“If managers don’t understand how data works and aren’t familiar with the terminology, they often treat what’s coming from the data side like a black box.”
Another common challenge can crop up when data scientists train and query their models from two different sources: a warehouse and the production database. “I’ve personally spent weeks building out and prototyping impactful features that never made it to production because the data engineers didn’t have the bandwidth to productionize them,” wrote Max Boyd, a data science lead at Seattle machine learning studi Kaskada, in a recent Venturebeat guest post.
He points to feature stores as a solution, along with, more broadly, MLOps, a still-maturing framework that aims to bring the CI/CD-style automation of DevOps to machine learning.
All said, it’s tough to make generalized, black-and-white prescriptions. Even the preferred data-science-to-data-engineer ratio — two or three engineers per scientist, per O’Reilly — tends to fluctuate across organizations. “My sense is, have ownership separated, but keep people communicating a lot in terms of decisions being made,” Ahmed said.
He circles back to pipelines. Any repeating pipeline needs to be periodically re-evaluated. “You’d absolutely want to include both the data science and data engineering teams for a re-evaluation,” he said.
“Have ownership separated, but keep people communicating a lot in terms of decisions being made.”
How to Get There
The roles of data scientist and data engineer are distinct, though with some overlap, so it follows that the path toward either profession takes different routes, though with some intersection.
In terms of convergence, SQL and Python — the most popular programming languages in use — are must-knows for both. But companies with highly scaled data science teams will likely prefer candidates who are also skilled in areas traditionally associated with data engineering (big data tools, data modeling, data warehousing) for managerial roles.
Data science degrees from research universities are more common than, say, five years ago. New York University and the University of Virginia, for instance, both offer a master’s in data science. But tech’s general willingness to value demonstrated learning on at least equal par as diplomas extends to data science as well.
An ecosystem of bootcamps and MOOCs — many of which are taught through a Python lens. — mushroomed alongside the rise of data science, circa-2010. Rahul Agarwal, senior data scientist at WalmartLabs, advised in a recent Built In contributor post that those remain viable options, especially for those with strong initiative. (Another key takeaway: Consider on-ramping via an analytics job.)
The bootcamp trend hasn’t hit data engineering quite to that extent — though some courses exist. Traditional software engineering is the more common route. Engineers who develop a taste and knack for data structures and distributed systems commonly find their way there. The job could be viewed in effect as a software engineering challenge at scale.
But aspiring data engineers should be mindful to exercise their analytics muscles some too. “They may already know technical aspects, like programming and databases, but they’ll want to understand how their outputs are going to be consumed,” Ahmed said.