Want to know the difference between a data scientist and a data engineer? First imagine a data team has been tasked to build a data model. It could be any kind of model, but let’s say it’s one that predicts customer churn. While the data scientist designs the model’s framework and algorithms, the data engineer creates and maintains collection systems for the data used in the model.
Data Scientist vs. Data Engineer
When dealing with data, there exists both considerations of “design” (by data scientists) and “implementation” (by data engineers), said Javed Ahmed, a senior data scientist at bootcamp and training provider Metis.
Why are such technical distinctions important, even to those not working directly with data? Because few business professionals — and even fewer business leaders — can’t afford to not know the difference.
“If executives and managers don’t understand how data works, and they’re not familiar with the terminology and the underlying approach, they often treat what’s coming from the data side like a black box,” Ahmed said. “They may not fully appreciate what to look for in terms of how to evaluate results.”
The Difference Between Data Science and Data Engineering
The mainstreaming of data science, data engineering and “data-driven” insights is still a relatively recent phenomenon due to the ever growing nature of big data. But core principles of each have existed for decades. “The volume of data has really exploded, and the scale has increased, but most of the techniques and approaches are not new,” Ahmed said.
For instance, age-old statistical concepts like regression analysis, Bayesian inference and probability distribution form the bedrock of data science. The statistics component is one of three pillars of the discipline, explained Zach Miller, lead data scientist at CreditNinja, to Built In. “One [pillar] is programming and computer science; one is linear algebra, stats, very math-heavy analytics; and then one is machine learning and algorithms,” he said.
What Is Data Science?
Pillars of data science
- Computer programming
- Statistics and linear algebra
- Machine learning and algorithms
Here’s our own simple definition: “[D]ata science is the extraction of actionable insights from raw data” — after that raw data is cleaned and used to build and train statistical and machine learning models.
Domain expertise is key to understanding how everything fits together, and developing domain knowledge should be a priority of any entry-level data scientist. Data scientists are also responsible for communicating the value of their analysis, oftentimes to non-technical stakeholders, in order to make sure their insights don’t gather dust. Familiarity with dashboards, slide decks and other visualization tools is key.
What Is Data Engineering?
Pillars of data engineering
- Big data storage and processing
- Data pipelines
- Model ETL (Extract, Transform, Load)
In a nutshell, data engineering involves maintaining the infrastructure that allows data scientists to analyze data and build models. Though the title “data engineer” is relatively new, this role also has deep conceptual roots. What bedrock statistics are to data science, data modeling and system architecture are to data engineering.
System architecture tracks closely to infrastructure. Depending on set-up and size, an organization might have a dedicated infrastructure engineer devoted to big-data storage, streaming and processing platforms. Think Hadoop, Spark, Kafka or Azure. Without an infrastructure engineer role, that falls under the data engineer’s purview.
Likewise, data modeling — or charting how data is stored in a database — as we know it today reached maturity years ago, with the 2002 publication of Ralph Kimball’s The Data Warehouse Toolkit. Needless to say, engineering know-how, or chops, is a must. If you were to underline programming as an essential skill of data science, you’d underline, bold and italicize it for data engineers.
Data Scientist vs. Data Engineer: Skills, Roles and Responsibilities
What Is a Data Scientist?
A data scientist is responsible for analyzing data and extracting relevant insights and trends to make business decisions. Data scientists also tend to build and utilize data models and machine learning algorithms to help find this type of information.
To become a data scientist, candidates are usually required to earn a bachelor’s degree in data science, computer science or a similar field, as well as hold several years of data analysis experience.
What Does a Data Scientist Do?
Since data science took off in the early 2000s, the role has become fairly codified. It’s a given, for instance, that a data scientist should know Python, R or both for statistical analysis; be able to write SQL queries; and have some experience with machine learning frameworks such as TensorFlow or PyTorch.
But that’s not to say every company defines the role in the same way. Take perhaps the most notable example: ETL.
ETL stands for extract, transform and load. It refers to the process of pulling messy data from some source; cleaning, massaging and aggregating the formerly raw data; and inputting the newly transformed, much-more-presentable data into some new target destination, usually a data warehouse. (Note: Since the advent of tools like Stitch, the T and the L can sometimes be inverted as a streamlining measure.)
ETL is more automated than it once was, but it still requires oversight. That’s traditionally been the domain of data engineers. In that sense, Ahmed is a traditionalist. He said having the ETL process owned by the data engineering team generally leads to a better outcome, especially if the pipeline isn’t a one-off.
“If you’re building a repeating data pipeline that’s going to continually execute jobs, and continually update data in a data warehouse, that’s probably something you don’t want managed by a data scientist, unless they have significant data engineering skills or time to devote to it.” Ahmed said.
But that’s not how it always plays out. Data scientists at Shopify, for example, are themselves responsible for ETL. “The data scientists are the ones that are most familiar with the work they’ll be doing, and in terms of the data sets they’ll be working with,” said Miqdad Jaffer, senior lead of data product management at Shopify.
The similarly data-forward Stitch Fix, which employs several dozen data scientists, was beating a similar drum as far back as 2016. “Engineers should not write ETL,” Jeff Magnusson, vice president of Stitch Fix, stated in no uncertain terms.
His argument is worth quoting at length:
“For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.
Instead, give people end-to-end ownership of the work they produce (autonomy). In the case of data scientists, that means ownership of the ETL. It also means ownership of the analysis of the data and the outcome of the data science.”
What Is a Data Engineer?
A data engineer is responsible for building and maintaining system architectures that collect and process large amounts of data. These systems act as the homebase from which data scientists draw their working data. Data engineers also help in cleaning and developing data pulling methods used in data models.
To become a data engineer, candidates are usually required to earn a bachelor’s degree in computer engineering, computer science or a similar field, as well as hold several years of experience in computer or software engineering, data analysis or project management.
What Does a Data Engineer Do?
Data engineers routinely maintain model systems for data to be collected and used. Though to make an efficient data model workflow, data engineers must work closely with data scientists.
A data engineer’s production vision for data production may contradict with the model’s actual construction, making for another potential challenge. In the case of ETL practices, a data scientist might prefer a slightly different aggregation method for their modeling purposes than what the engineering team has developed. But the engineering side might be hesitant to switch, depending on the difficulty of the change, Ahmed said.
But even being on the same page in terms of environment doesn’t preclude pitfalls if communication is lacking. Say a model is built in Python, with which data engineers are certainly familiar. The engineering side could potentially jump into the prototype and make changes that seem reasonable to them, “but might just make it harder for the original author to understand,” Ahmed said.
Another common challenge can crop up when data scientists train and query their models from two different sources: a warehouse and the production database. “I’ve personally spent weeks building out and prototyping impactful features that never made it to production because the data engineers didn’t have the bandwidth to productionize them,” wrote Max Boyd, a data science lead at Kaskada, in a Venturebeat guest post.
He points to feature stores as a solution, along with, more broadly, MLOps, a still-maturing framework that aims to bring the CI/CD-style automation of DevOps to machine learning.
All said, it’s tough to make generalized, black-and-white prescriptions. Even the preferred data-science-to-data-engineer ratio — two or three engineers per scientist, per O’Reilly — tends to fluctuate across organizations.
“My sense is, have ownership separated, but keep people communicating a lot in terms of decisions being made,” Ahmed said.
Can a Data Scientist Become a Data Engineer?
The roles of data scientist and data engineer are distinct, though with some intersection in tasks, so it can be possible to follow routes between either occupation.
In terms of convergence, SQL and Python are must-knows for both. SQL is one of the most popular languages for storing, manipulating and retrieving data, while Python is one of the most popular programming languages. But companies with highly scaled data science teams will likely prefer candidates who are also skilled in areas traditionally associated with data engineering (big data tools, data modeling, data warehousing) for managerial roles.
Company size and employee expertise level also play a role in who does what in regard to ETL and data model creation. Organizations like Shopify and Stitch Fix have sizable data teams and are upfront about their data scientists’ and engineers’ programming chops. However, smaller teams may have a tough time replicating such a workflow, making for the occasional merging of responsibilities into both roles. “Not all companies have the luxury of drawing really solid lines between these two functions,” Ahmed said. “There’s often overlap.”
While holding similarities, there are still some hurdles to keep in mind for a transition between a data scientist and data engineer, and vice versa.
An ecosystem of data science bootcamps, massive open online courses and university degrees have grown alongside the rise of data science itself. Rahul Agarwal, senior data scientist at Meta, advised in a Built In contributor piece that those remain viable options, especially for those with strong initiative. (Another key takeaway: Consider on-ramping via an analytics job.) But not the same amount of opportunities exist for data engineering.
The bootcamp trend hasn’t hit data engineering quite to that extent — though some courses exist. Traditional software engineering is the more common route. Engineers who develop a taste and knack for data structures and distributed systems commonly find their way there. The job could be viewed in effect as a software engineering challenge at scale.
Just as aspiring data scientists are encouraged to know some areas of data engineering, aspiring data engineers should be mindful to exercise their analytics muscles too. “They may already know technical aspects, like programming and databases, but they’ll want to understand how their outputs are going to be consumed,” Ahmed said.
Data scientist or data engineer, both roles have become vital to creating the millions of data models used by businesses today.