As a society, we’re drawn to stories of overnight success and of mythical genius that achieves the impossible by building something from nothing. In reality, though, success is more often the result of slow, consistent experimentation and tinkering. Creating an environment that supports the iterative development of ideas and validates them in phases could be the key to unlocking the wins that might later be reported as “overnight” success.
At Transfix, we’ve followed this exact script to grow our business consistently year over year. We empower our data science team to be self-sufficient so they have room to experiment while also giving them the tools necessary to enable smoother collaboration when it’s time to take those ideas to the next level and integrate them into our broader processes.
When we launched our first forecasting model to production nearly five years ago, the positive impact it had on some of our business processes was immediately apparent. Using this model, we were able to generate contract pricing guidance for thousands of city-city lanes in minutes, which had taken days in the past, while still allowing our team to make adjustments based on their knowledge of the customer and market.
Given these successes, we pushed additional models into service. Although we continued to reap benefits from them, the process of taking them from Python notebook to production took a lot longer than we wanted it to. Deployment also required spinning up custom infrastructure and careful oversight to complete model training. Making changes was cumbersome, and the existing tooling we had couldn’t be repurposed for future models. Meanwhile, our data science team had several new ideas brewing in their notebooks, leading to increasing frustration about the org’s ability to support the team in making their ideas reality.
So, a couple of years ago, we set out to understand why things were taking so long. One of the things we had to internalize was that the ability to develop complex models in Python doesn’t translate to the ability to get those models into production and vice versa. Development and deployment are very different skill sets. Meanwhile, our engineering teams were deploying code to production multiple times a day without any support from Ops or SRE.
Empowering the Team
We wanted to see if we could bring this kind of freedom and speed of iteration to our data science team, so we started out with this statement in a Google doc:
“Data scientists at Transfix should be able to quickly and safely develop, deploy and validate new models in a production environment with very little engineering/infrastructure support.”
Given this goal, instead of dealing with machine learning (ML) frameworks and algorithms, we decided to focus our efforts on people and workflows. We built a framework and a set of strong recommendations that we collectively refer to as Machine Learning Platform (MLP). Together, these installed a tight collaboration loop between the data science and data engineering teams as well as the engineering teams that are the machine learning models’ end-users.
And it’s worked. Since the MLP’s launch, our data science team has managed to deploy over a dozen new models to production with minimal involvement from our engineering and infrastructure teams.
We achieved this result by anchoring the MLP project to three key principles.
3 Principles for Empowering a Data Science Team
- Open source.
- Compose using familiar building blocks.
- Acknowledge the translation step and create a shared vocabulary.
1. Open Access
Typically, spinning up new services or launching new servers involve having a conversation with the infrastructure group and getting their buy-in. We removed this step and let our data science team have the freedom to spin up any compute/storage resource they needed using AWS SageMaker.
We designed the system so that everyone on the team has access to the same information: when the model was trained, which container image (and code) was used, how long it ran for, instance size used, compute resources consumed, scaling configuration for the inference endpoints and so on. The infrastructure team can leverage the SageMaker / AWS UI to audit usage periodically and make recommendations. The fact that it ties into our existing AWS account management and billing setup is a nice, added bonus. We also ensure that the observability stack is accessible to anyone so we can work as a team when debugging issues.
By switching from gatekeeping to an open-by-default state for access to infrastructure, we were able to save time that was spent figuring out infrastructure needs too early in the process. This also empowered our data science team to focus on the ML models and experiment as much as they needed until they got the result they were happy with.
2. Compose Using Familiar Building Blocks
We required that all code be checked into GitHub, even during the experimentation phase. We also have our CI system (CircleCI) run tests and build a container image and then push the image to a container registry (AWS ECR). It helped with collaboration both within the data science team itself and across other teams while also allowing us to track code lineage (e.g., a model prediction could point back to the training artifact and code used).
This workflow would look familiar to many engineers but it is not common for data scientists to be thinking of continuous integration and containers. By adding version control, CI and automated container build steps, we were able to enable collaboration within the data science team and also between teams. Anyone on the team could pull up the code in Github, share links when discussing changes, pull down the container image if needed and execute it in an environment that very closely resembled the production environment. This setup also enabled us to track code lineage, so that we could identify if changes in predictions were due to potential model drift or changes made to code.
Airflow is used for workflow orchestration and scheduling, a relational database as our feature store and S3 to store offline model artifacts. To tie it all together, we built a command line interface (CLI) and some custom glue code in Python to keep it accessible to our data science team.
By using tools that our engineering teams were already familiar with, we were able to leverage their expertise and also get their buy-in for collaboration.
3. Acknowledge the Translation Step and Create a Shared Vocabulary
Things that might seem obvious to one group may not be clear to the others. For example, data science terms like feature engineering, training, inference and engineering workflows items like pull requests, deployments, monitoring, logging and exception tracking had to be shared between teams for better collaboration.
Even with the best tools, processes and intentions, acknowledge that going from experimentation to production takes time. Similarly, collaboration between teams takes effort. Make sure you talk about and accept this overhead during the project planning phase and to ensure you allocate enough time.
Fix Your Production Problems!
We have been using this framework for the past 18 months, and we’re very happy with the results. We have big plans for the data org in the years ahead and are eager to see how this process scales and where we need to iterate further.
If you’re a data scientist or a machine learning engineer feeling frustrated or stuck in a constant experimentation stage instead of taking models to production, or you’re an engineer that wants to enable model integration into production workflows, we hope you can take away some ideas from our process and the approach we adopted. Or better yet, come work with us: Develop the next generation digital freight platform to build a more robust supply chain!