It’s one of the eternal questions for any campaign ahead of an election: Who is likely to actually show up at the polls?
For data scientist Matthew Brems, trying to answer that question gave rise to another, about one of the most important challenges in machine learning: How can I build the most effective machine learning pipeline?
Ahead of the 2016 presidential election, Brems helped build a likely voter model for a consulting firm. It hinged on two types of data sets: voter registration and polling data culled from surveys. The firm conducted polls several times a week in different states across the country in order to feed the survey component.
Brems’ team also wrote a script to automate some cleaning of that polling data, which was then matched with the registration data, split into training and testing sets, and used to fit multiple models. Once his team validated the best option, the model could be deployed to make predictions for any state.
The project gets to the heart of machine learning pipelines: automation and reproducibility. Which steps can be automated to save time and resources? Answering the question is harder than it sounds.
“Automation is really helpful to organizations, but it’s not without its pitfalls,” said Brems, who now works as a data science instructor at General Assembly.
Broadly speaking, ML engineers try to avoid pitfalls by employing the kinds of agile workflows that migrated from DevOps and took root as MLOps.
“Think about a lean approach — having sufficiently good progress at ingestion, preprocessing, modeling and deployment, rather than being perfect at each stage,” said Joseph Nelson, co-founder of Roboflow and a fellow General Assembly data science instructor.
So what does that look like in practice? We asked five expert data pipeline builders to offer some pointers.
Best Practices for Building a Machine Learning Pipeline
- Clarify your concept. Getting this right can be harder than the implementation.
- Make sure data collection is scalable. An API can be a good way to do that.
- Ensure that your data input is consistent. Unexpected inputs can break or confuse your model.
- Look out for changes in your source data. If your model depends on data you don’t control, do spot checks to ensure you’re measuring what you think you are.
- Don’t reinvent the wheel. Someone else probably built a good foundation for your project.
- Deploy fast. Your model isn’t helping anyone until it’s out there.
- Unless mistakes have severe consequences. Obviously.
Problem Definition and Data Exploration
Clarify Your Concept
Arun Nemani, Senior Machine Learning Scientist at Tempus: For the ML pipeline build, the concept is much more challenging to nail than the implementation. From a technical perspective, there are a lot of open-source frameworks and tools to enable ML pipelines — MLflow, Kubeflow. The biggest challenge is to identify what requirements you want for the framework, today and in the future.
For example, do we need to have version-controlled services, such as DVC, integrated, or will date-stamped folders with unique IDs suffice? Do we need automated ingestion with cloud services, or can we manage workflow with Python scripts and notebooks on a single server? Clearly answering the current and future needs will greatly increase value.
Think Critically About Categorical Variables — Long Before Modeling
Nemani: The biggest impact for model robustness for categorical variables actually lies before any modeling steps, during data exploration.
Ask questions like: How was that categorical variable sourced? Automatically, or through data entry? How many unique entries are in that particular categorical column? If it’s too many, would this pose a problem for overfitting?
Should we even use specific categorical variables? For example, should we use categories such as patient race for healthcare? Does this improve or harm equitable ML models? Is that particular categorical variable measured after or before the machine learning target variables? Most categorical variables measured after the target would be a train wreck!
Ingestion and Pre-Processing
APIs to Automate and Scale Data Collection
Joseph Nelson is a cofounder of Roboflow, which creates computer vision infrastructure. For example, Roboflow user and General Assembly graduate Jamie Shaffer built an object detection model that counts salmon. She experimented with various preprocessing and augmentation options and trained a YOLO v5 model to track fish populations.
Nelson: If you’re building a model for a new domain, a common tactic is to get some amount of images of, in this case, a fish, to start bootstrapping a model to learn. That helps even if it’s not the exact same fish that will exist in your inference conditions — if I just have some freshwater fish rather than specific salmon, in this example.
“It behooves a team to collect images that are similar to those you’d see under deployed conditions.”
It behooves a team to collect images that are similar to those you’d see under deployed conditions. How can you scalably collect new data, both for building that first model and, more importantly, for the ongoing monitoring and improvement of the given model? That could include building an API to automatically collect images from cameras wherever they’re deployed.
That data ingestion piece is not only where things begin, but it’s incredibly relevant for the ongoing performance and improvement of a model.
Data Consistency Is Critical for Cleaning
Matthew Brems, Lead Data Science Instructor at General Assembly: We were able to automate the data cleaning component of our voter model because everything took a similar form: Press 1 for Candidate A, press 2 for Candidate B, press 9 to repeat the question. Everything was predictable enough so that when we were writing these scripts, whether the polls had 10 questions or 20 questions, they still followed a similar format.
But if you’re working with a totally different data set every day, you’re limited in the amount of automation you can do. Consistency of data is pretty critical in being able to automate at least the cleaning part of it.
If you’re getting data from 20 different sources that are always changing, it becomes that much harder. Your pipeline is gonna break. But if data follows a similar format in an organization, that often presents an opportunity for automation.
Set Cadences
Brems: Organizations should ask their teams: “Is there something you’re doing every day? Every week? Every month?” Understand what employees are spending their time on and see if there’s a way it can be automated.
Any time you’re collecting new data that’s in a consistent format, there’s probably going to be some opportunity to automate.
Take recommender systems at streaming platforms. They’re constantly getting data and wanting to see whom they should push new content to. Maybe every Sunday, at 5 p.m., there’s a pipeline that’s triggered, automatically pulling in new information gathered over the past week. Then they refit their classification models or recommendation systems with the latest and greatest information.
Or take lifetime customer value forecasting. What’s the total value of a customer throughout the next, say, 24 months? As people continue to purchase new products, rerun the model with all the information that we gathered over the last month or week.
If something happens on a certain cadence — if an organization says, “We want to be able to refresh our models once a week, once a month, once a quarter, once a year” — pipelines can be helpful in doing that. Especially if it’s on a predictable cadence, the more you can automate that process, the better.
Feature Engineering and Stores
Automate Feature Engineering Where Possible, and Unit Test...
Brems: Feature engineering, broadly, is when you have columns or variables in your data set that you need to [transform into] new ones. It takes subject matter expertise, and it’s where a lot of organizations are going to find value in their data.
If you’re working with similar data sets, it’s something you can largely automate. In churn prediction, for example, an organization’s definition of churn doesn’t really change. You define it as somebody who’s not made a purchase in the last X months, create a line in Python or R and give them a one if they’ve churned or zero if not. Similarly, the definition of a likely voter doesn’t change. So that just becomes one line of code in your script that you add for every new feature that you want to engineer.
Virgile Landeiro, Data Scientist at Tempus: Unit tests should at least validate your feature processing steps and data splitting steps — and ideally your model outputs as well.
Unit tests on feature processing and data splitters help ensure that you’re not leaking information from your training to your testing data set. This is vital in making sure your model will perform well on unseen data. Without unit tests, you might figure out you’re leaking information only a few months before product launch, and you’ll have to retrain your model and review performance again. In the worst-case scenario, information leakage will go undetected and your served model will perform poorly on unseen data.
...But Be Intentional About Feature Extraction
Brems: Feature extraction describes a broad group of statistical methods to reduce the number of variables in a model while still getting the best information available from all the different variables. If you fit a model with 1,000 variables versus a model with 10 variables, that 10-variable model will work significantly faster.
Feature extraction is something that people may automate, if you always know, for instance, “Hey, I only want 10 variables in my model.” You can apply feature extraction algorithms as a type of transformer that pre-processes your data before you fit a model to that data and start generating predictions from it.
I don’t recommend feature extraction for all organizations. If a company has the computing power or enough data, then feature extraction may not be right for them. But it can have a lot of benefits. It’s going to be very case by case.
Without subject matter expertise, and without knowing that extraction tends to work really well on your data, that’s probably going to be more one-off, and you might not necessarily want to automate that part of the process.
Experiment, Then Codify
Nelson: When it comes to pre-processing, there’s experimentation and there’s codification.
Experimentation means exploring different ways to make it really easy for data scientists to try out different preprocessing steps. Then, once you find what works well, how do we codify those steps?
Now you’re seeing organizations build things like feature stores — Uber’s internal product Michelangelo, for example, helps data scientists not reinvent the wheel of which features work best. If, say, a pricing team in Washington, D.C., finds that it’s useful to do surge pricing at a specific time and place, the team can use that same set of pre-processing attributes without having to experiment unnecessarily.
Pinterest built an internal product for pre-processing images in a repeatable way, which was the inspiration for Roboflow. The goal is to make it easy for computer vision teams to have feature stores, if you will — trying different features on a single set of images. Because if it worked well for identification in one [environment], the same steps may work well in another.
Beware of Source Data Changes
Brems: If your source data changes, that can cause big issues. In our likely voter model instance, we got data from the 50 different secretaries of state around the country. So if any one of them changed their data, the automation procedure that we had put in place would break, because information would not be going in the right space. The worst scenario is if they change the columns, but it looks similar. You may not even notice.
If I were working for an organization that wanted to automate some data cleaning, doing both automated and manual checks on the data is really important.
For example, we built a series of models that focused on Iowa. We were getting ready to send it to the client. And unbeknownst to us, the data that we had been sent was for Ohio, not Iowa, and our system didn’t detect it. Had we sent that to the client, it would have looked bad. So be wary of those data changes, make sure there are checks, like summary statistics and some manual checks. That can be very low tech, like spot-check 10 rows of data in a spreadsheet to make sure it matches with what’s expected.
That helps protect against changes in the source data, which are inevitable. That’s going to happen.
Andrew Candela, Engineer at Lambda School: We use dbt at Lambda School. That has testing functionality built in. In ETL, sometimes a key that you expect to be a primary key winds up with some duplication in one of the downstream tables. Suddenly you’re training on messed-up data.
Dbt is helpful because you can set up automatic tests to check for uniqueness of a primary key or check that a field is never null. That’s the kind of stuff that would bite us all the time at [my previous role]. Maybe some product engineer decides to change the name of an event type, and suddenly I’m pulling in all nulls for [a field]. You need tools to help you QA and inspect data.
Model Comparison
For Data Versioning, Start Simple
Landeiro: A simple folder and file-naming scheme can go a long way to share data within a small team and ensure that you’ll be able to re-run models on old data in a couple of months — and that you are keeping track of changes in your data. You can consider using data versioning tools if your versions get out of hand. Some open-source tools, like DVC, offer a lot more flexibility when it comes to data versioning, but they also carry a learning curve that might not be worth it for small data science teams.
For Orchestration, Consider Features and Flexibility
Landeiro: I’ve used MLflow (open source only) in the past for experiment tracking and thought it was a great tool. By just adding a few lines to your code, you can track metrics, parameters and artifacts, such as images. So you can easily track your best models and share a Kaggle-style leaderboard with your team.
I’ve enjoyed using Metaflow. Netflix open sourced it in 2019, and it provides similar features to MLflow Tracking for users that don’t need a GUI. It provides dependency management through Conda, abstractions to organize your research (tags and user spaces) and it integrates with AWS to automatically store your results and scale your jobs on AWS Batch. One downside of Metaflow — for now — is that it’s only compatible with AWS.
The choice is a matter of provided features (experiment tracking, data versioning, GUI), cloud compatibility, and whether or not it’s easily extensible.
Nemani: For hyperparameters, scikit-optimize provides a drop-in replacement for scikit-learn’s search objects, like GridSearchCV, that implements Bayesian optimization. Neural network architectures are a bit more challenging for finding optimal parameters — however, there are some tools available such as Keras Tuner for learning “automated” hyperparameter tuning.
Model Selection and Production
Adapt, Don’t Build Ground-Up
Nelson: Researchers are creating open-source model architectures that work exceptionally well for a number of different domains. There’s a joke that all Kaggle is is XGBoost, because it’s won the most competitions — if you’re doing Kaggle, just grab an XGBoost model.
When doing image problems, like object detection, you’re probably going to grab something from the YOLO family of models. Or in text, a BERT-based transformer.
Often when modeling, you’re better off starting from a place where researchers have put out a state-of-the-art architecture and then fine-tuning both the data and parts of the model to your domain. In other words, rather than thinking that you should build a model from scratch, usually the best thing to do is start with an existing open-source model architecture, get a baseline level of performance, and then fine-tune both the data and elements of that model to your domain.
Landeiro: A lot of code we use is open source. There are great Python packages out there no matter how complex your pipeline is. A few are Luigi, Snakemake, Metaflow, Prefect. If you need to write custom code, these packages are often written in a modular way that allows for it.
Deploy, Deploy, Deploy...
Nelson: Machine learning, especially at big companies, is being pushed the wrong way. ML is a game of probabilistic output: you will likely never see 100 percent [effectiveness], and you will always be able to improve your model. That’s really challenging for traditional companies to grasp. Letting perfect be the enemy of the good is very easy, because the model can always be better.
“Letting perfect be the enemy of the good is very easy, because the model can always be better.”
The mistake comes in not deploying early enough, saying, “We could always try another model,” or, “We could always see about doing inference faster.” Get an initial model out there, while also having ways to handle error cases and improve model performance. Think less waterfall and more lean agile development.
Even if a model can only support just 10 inferences a second, that’s fine. Create queues and only release it to some users. That way you can shorten the production lifecycle down to a single quarter. By quarter two, you’ll have a stream of data from your deployed conditions. That’s what good ML teams are doing.
Candela: Business needs, customer behavior and the data itself changes constantly. If you expect to deploy a model and then have it perform in production unaltered for greater than a month or so at a time, then you’re lucky to exist in a very static industry.
...Unless Getting It Right Is a Matter of Life or Death
Nemani: It’s crucial to determine the business problem you’re solving and whether deploying frequently serves that need. If you’re building a recommendation engine for content delivery, it may be useful to deploy quickly, gather production information about failure modes and adjust new models. At Tempus, however, we’re solving disease models in oncology and cardiovascular disease that may impact the mortality of patients. In that scenario, the topic is longer about deploying frequently, but rather making sure the models are robust prior to deployment.