It seems as though even companies — along with the job descriptions they post — are confused about what constitutes a data scientist versus a machine learning engineer. So, I’m here to provide some insight into why the two roles are separate entities and where they can overlap.
When I was first studying to become a data scientist, I was embarrassingly unaware of what a machine learning engineer does. I quickly came to realize that the field is similar to data science but different enough to require a unique set of skills. Long story short, data science involves researching, building and interpreting models, whereas machine learning involves the production of the models themselves. Now that I have been in the field for a few years, gaining experience in both disciplines, I have developed a clear outline for what constitutes a data science role versus one in machine learning engineering. So, read on to avoid making my same mistake!
Data Science Versus Machine Learning
- Data science, in its simplest terms, can be described as a field of automated statistics in the form of models that aid in classifying and predicting outcomes. data science involves researching, building, and interpreting models.
- Machine learning involves the production of the models themselves
So, is a data scientist just a statistician? Kind of. Data science, in its simplest terms, can be described as a field of automated statistics in the form of models that aid in classifying and predicting outcomes. The most important technical skills that are required to be a data scientist are:
- Python or R
- Jupyter Notebook
Let me expand on each of these a bit.
I believe most companies are looking for Python proficiency more than R. Some job descriptions list both, but most people you’ll be working with (i.e., software, data and machine learning engineers) won’t be familiar with R. Therefore, to be a more holistic data scientist, I think that Python will be more beneficial for you.
This may seem more like a data analyst tool at first. Although it is useful for that as well, it should still be a process you employ for data science. Most data sets are not given to you in the business setting (as opposed to academia), and you will have to make your own via SQL.
SQL has plenty of subtypes like PostgreSQL, MySQL, Microsoft SQL Server T-SQL and Oracle SQL. These are all similar forms of the same querying language that are hosted by different platforms. Because they’re so similar, having any of these in your toolkit is useful and can allow you to easily pick up different forms later on.
This is almost the exact opposite of a machine learning engineering tool. A Jupyter Notebook is a data scientist’s playground for both coding and modeling. It is research environment, if you will, that allows quick and easy Python coding that incorporates commenting out of code, the code itself, and a platform to build and test models from useful libraries like Sklearn, Pandas, and NumPy.
Overall, a data scientist can do many things, but the main functions of the role are the following:
What Does a Data Scientist Do?
- Meet with stakeholders to define the business problem.
- Pull data via SQL.
- Conduct EDA, feature engineering, model building and prediction via Python and Jupyter Notebook.
- Depending on the workplace, compile code to .py format and/or pickled model.
This resource from the University of California, Berkeley offers more useful information about the role, including payscale and job outlook.
Machine Learning Engineering
After the last point in the list above is where a machine learning engineer comes in. The main function of this role is to put that model into production. A data science model can be quite static sometimes, and an engineer can help to automatically train and evaluate it. They would then insert the predictions back into the data warehouse/SQL tables for your company. After that, a software engineer and UI/UX designer will put the predictions into a user interface for display if necessary.
As you can see, the whole process from defining the business problem to presenting its solution in a visible, easy-to-use format is not the sole responsibility of a data scientist, although some data scientists can do all of these roles themselves.
A machine learning engineer may also be called ML ops (short for machine learning operations). A summary of their workflow is something like this:
Machine Learning Workflow
- pkl_file of data science model
- Storage bucket (GCP — Google Cloud Composer)
- DAG (for scheduling the trainer and evaluator of the model)
- Airflow (visualizes the process — ML pipeline)
- Docker (containers and virtualization)
At first, perhaps data science and machine learning seem like interchangeable titles and fields. With a closer look, however, we can realize that machine learning combines principles of software engineering and data engineering more than it follows data science. For more information on the role, as well as visualizations and processes, check out this machine learning operations overview by Google.
Below, I will outline where the fields do and do not overlap.
Perhaps the biggest point of overlap between data science and machine learning is that they both touch the model. The main tools and principles that both fields share are:
- Concept of training and evaluating data
The comparable skills are primarily in programming and the languages each uses in the respective roles. Both positions engage in some form of engineering, whether that be a data scientist querying a database using SQL or a machine learning engineer using it to insert the suggestions or predictions from the model back into a newly labeled column/field.
Both fields require knowledge of Python (or R) and the ability to conduct version control, code sharing, and pull requests through GitHub.
A machine learning engineer may want to know how algorithms like XGBoost or random forest work and will need to look at the model’s hyperparameters for tuning in order to conduct research on memory and size constraints.
I already outlined some of these differences above, but some key features of both careers and their underlying academic philosophies are important to point out:
- Focuses on statistics and algorithms.
- Deals with unsupervised and supervised algorithms.
- Practices regression and classification.
- Interprets results.
- Presents and communicates results.
- Focuses on software engineering and programming.
- Handles automation, scaling, and scheduling.
- Incorporates model results into a table/warehouse/UI.
Not only can the two roles differ in the workplace, but in academia/education as well. There are different routes to becoming a data scientist and machine learning engineer. A data scientist might focus on statistics, mathematics, or actuarial science, whereas a machine learning engineer will have their main focus on software engineering. Some institutions do offer specifically machine learning as a certificate or degree. To learn more about becoming a data scientist online, you can read my advice here.
Data Science Versus Machine Learning
Although different people, companies and job descriptions have different ideas about what each career involves, I certainly believe that there is a significant distinction between the two positions. Some skills do indeed overlap, but in general, a data scientist focuses on statistics, model building, and interpretation of outcomes. The machine learning engineer will take that model, scale it, and deploy it into production.