How Data Scientists and Engineers Can Work Better Together
Most tech companies agree that data is king, but many are still struggling to integrate data science teams into engineering workflows.
Some toss data scientists into development sprints without educating either team on the timing and requirements of the other’s work. Others turn a blind eye as silos build up, and data scientists and engineers lob models back and forth over a metaphorical wall.
At data storage behemoth Pure Storage, there’s no time for those antics.
Pure Storage data scientists float from project to project, becoming full-fledged production team members and staying on until the release. Periodically, data team members reconvene to share takeaways from their various projects.
“When a data scientist is working on a project, they actually integrate with the rest of the team and build it end-to-end,” Farhan Abrol said. Abrol is head of machine learning products at Pure Storage and works on Pure1 Meta, an AI engine that uses IoT data to optimize storage environments.
“So data scientists aren’t just learning what needs to be produced in the model and then going off and doing it. They actually come back and say, ‘Here’s a model,’ and then they deal with it [by helping optimize it] for the rest of development,” he said.
At customer data management company Amperity, the scrappy startup environment means data scientists get to take on multiple roles in a production process.
“We don’t have a separation between who’s doing the research and who is productionizing,” Aria Haghighi, VP of data science at Amperity, said. “So the researcher is also the engineer who is bringing the system to production. Once we feel convicted we’re going to actually put something into the product, that person will go into the normal engineering cycle and break stuff up into smaller, fine-grained tasks.”
How to integrate data science and engineering workflows:
- Democratize your data engineering. Red tape in your data pipeline will throw off data scientists and engineers alike.
- Boost the importance of data science in the early stages of product development. Data can drive innovation, but only if you make room for it.
- Distinguish between research and implementation. One process is linear — the other is not.
- Educate data scientists and engineers on the variety of ways they can improve each other’s processes. With good collaboration, the development process can become a powerful feedback loop.
Democratize your data engineering
When data scientists have easy access to — and intimate knowledge of — the data at their disposal, it’s easier to build their research into production workflows. Conversely, when a company’s data infrastructure or tooling becomes convoluted, inefficiency abounds.
Consider an example in which data engineers set up a data warehouse, but data scientists have to ask for time-consuming reconfigurements each time they want to examine new variables. In that case, data scientists don’t have sufficient tools at their disposal.
Or, perhaps each data scientist goes their own way with tooling. Before long, data engineers are supporting a dozen different data stacks for a dozen different projects, bogging down workflows across the entire team.
At a small company, Haghighi said, managers can sidestep these problems by expanding the responsibilities of data scientists to include data engineering.
At Amperity, the first task of the data scientist working on their predictive lifetime spend model — which calculates a customer’s lifetime value to the company using a variety of variables — was to create the data pipelines that gave her the best possible view of the data. That approach ensured she had access to all the signals and sources she needed. It also prepared her to help with the model’s implementation.
“I’ve had the experience of going from a very academic setting to a very practical one and thinking that a lot of things weren’t my job, only to learn that actually, they are my job.”
“As a researcher myself, I’ve had the experience of going from a very academic setting to a very practical one and thinking that a lot of things weren’t my job, only to learn that actually, they are my job,” Haghighi said. “So I try to get data scientists who appreciate that they’re responsible for both productionizing and data engineering. These are all parts of the trade.”
At a large company, however, combining data science and engineering roles is often not feasible. In that scenario, organizations should let data engineers, data scientists and production engineers collaborate on a data interface that meets everyone’s needs, Abrol said.
At Pure Storage, that meant attaching a self-service layer to the company’s data infrastructure, allowing data scientists to extract raw data with minimal involvement from data engineers. It also lets developers easily convert the resulting models to production code.
To decide what the data interface needed, data engineers, data scientists and software engineers had ongoing conversations about what tools were most conducive to the production process as a whole.
“All you’re doing is defining interfaces that help people,” Abrol said. “The way to get that to scale is to have clear definitions and ownership for the interfaces, as well as a committee that decides the interfaces that’s not one-sided. You don’t want data engineers saying, ‘Look, here’s what we’re giving you, data scientists, have fun.’”
Data engineers shouldn’t feel pressured to build the interfaces right away, Abrol said. Let the platform grow organically over time.
“No teams should start by saying, ‘Let’s build an ironclad platform with perfect support for 15 types of data science connectors,’” he said. “Always build simple first.”
That may mean building the tooling for one project at a time, keeping a sharp eye out for patterns and duplications. As the organization ships more products and features, data engineers will get a better sense of what the platform needs, and data scientists can access data more independently.
“Make that a mandate of data engineers,” Abrol said, “as opposed to measuring them by how quickly they can get a particular project off the ground.”
Boost the importance of data in the early stages of product development
Often, a product manager’s interest in data science begins and ends with user metrics — after a product ships. That can lead to some significant missed opportunities.
“I wanted to expand the scope so that data science isn’t just examining how the product is doing, but also how the data the product is working with could be utilized in production, right in the features we’re generating,” Abrol said. “Data scientists can drive innovation if you bring them in early enough.”
Building data science into each product development process — in the earliest stages, no less — ensures teams aren’t leaving valuable insights on the table.
One way data scientists can help during early design sessions is educating engineering on the data that’s already available and the insights it can generate. That way, product teams are consistently considering new feature capabilities based on existing data, instead of relying on top-down ideas.
Data scientists can also track what data is missing and pass that information along to data engineers. For instance, “We could build a feature that does [X] if only we had [X] data.” By looking toward future data needs, the team expands its product possibilities.
“Data scientists can drive innovation if you bring them in early enough.”
Bringing data scientists into the product development process earlier does not mean engineers should sit back and let the data team set the agenda for new products, however. One of the biggest benefits of a more integrated development process is the chance to get data and engineering on the same page, especially when it comes to engineering dependencies.
If a particular data model would require significant backend retooling, for example, it’s important that engineers and data scientists notice that early.
When Amperity was productionizing its predictive lifetime spend model, for example, the team planned to deploy the machine learning model inside an existing data infrastructure and then automatically build features from there using SQL. But because of the hundreds of nontraditional data sources the model drew on, the programmatically generated SQL couldn’t keep up. They needed a lower-level interface to work with the data more granularly.
If the data science team hadn’t caught the issue early, production would have lurched off schedule.
“I often find when I’ve been unhappy with how long it takes for something to get to code delivery, it’s usually because we failed at that stage,” Haghighi said. “Spotting that dependency early allowed us to ship that a lot faster than had we spent a bunch of time figuring out it wasn’t going to be acceptable.”
Distinguish between research and implementation
Product development is linear; research is not.
That disparity presents perhaps the trickiest challenge of integrating data science and engineering teams: Weaving exploratory work into heavily structured development workflows.
Segmenting research goals into production sprints is one approach, but it creates problems when research moves faster or slower than expected, when research reveals that a feature is unworkable or when data scientists continue tweaking models when doing so is no longer time- or cost-effective.
A solution to that first problem, Abrol and Haghighi said, is to separate the research phase from the implementation phase and let data scientists work on their models offline before production begins. By front-loading the research, data scientists get the opportunity to explore additional possibilities, and engineers avoid putting time into models that don’t cut it.
For example, Amperity is working on a lookalike model — popularized by Facebook, where Haghighi used to work — that helps marketers identify people similar to their existing customers. The project’s single researcher got an early start, building models that weren’t in the company’s production codebase and collaborating with clients to validate the models.
“In this research phase, it’s very easy to go build something that may not be useful,” Haghighi said. “What we find is, rather than worrying about integrating this into the engineering workflow, make sure that during this research phase, you’re really connected to the actual customers and their use cases.”
Abrol echoed that sentiment. At the beginning of a research stage at Pure Storage, data scientists sit down with product managers and designers to clearly define what value the project should create for customers.
“First, list out the targets and the ideal user scenarios with the data that you have.”
“First, list out the targets and the ideal user scenarios with the data that you have,” he said. “You really want to ask, ‘What is experience I’m trying to drive?’”
Once the team sets goals and data scientists lay out some modeling options, start prototyping immediately, Abrol said. Creating a minimum viable product sets a baseline for a given model’s accuracy and removes some uncertainty that could slow the process later on. From there, use sprints to narrow your modeling options from five to three, from three to two, and so on.
“I do try to have that structure there as much as possible: Here’s the 20 ideas we want to try, here’s my relative progress on these metrics from a couple of days ago to today,” Haghighi said. “When you actually step into engineering, you’ve already been forced to break down ideas and things you want to try into features.”
But be careful, Haghighi warned: Prototyping early can lead product managers, technical leads — even marketers — to think models are more airtight than they actually are. The key, unsurprisingly, is clear expectation-setting.
“People might say, ‘Oh, this model has a notebook where it's performing really well. That means it’s ready to go.’ I try to be clear that no, this really exists to remove a lot of uncertainty efficiently. There’s all the normal engineering work that still has to happen to make this a reality,” Haghighi said.
Expectation-setting is important after a model ships, as well. Data scientists may notice a model could achieve a 1 percent gain with some backend changes, but those changes likely aren’t worth the engineers’ time. Set some boundaries around changing models in production — whether that’s a benchmark for gain or A/B testing to ensure the change would contribute to a given business metric.
Encourage data scientists and engineers to make each other better
Strong collaboration between data science and engineering teams doesn’t just make products better — it can make workflows better, as well.
Just as a product generates customer data, a development process generates data about pipeline management, failure incidents, optimal infrastructures and more. Engineers can do themselves a favor, Abrol said, by handing that information over to data scientists.
“Organize an engineering productivity project where the data team goes in and helps figure out where the pipelines are broken or how to optimize a production parameter,” Abrol said. “Big companies like Facebook have entire teams dedicated to this, but it’s a way for small teams to feel like data science is helping engineers do their jobs better.”
Similarly, engineers can apply their system monitoring skills to data science models to help data scientists catch and fix problems faster.
“In data science, debugging is really, really hard,” Haghighi said. “Optimizing how models get productionized — and the best practices for monitoring and observability — is just really underappreciated and, frankly, in its nascency. I’ve been really impressed by the work our engineers have done to create an early warning system.”
In short: By drawing on each other’s strengths, data scientists and engineers can turn a tricky joint workflow into a refined feedback loop that continually reveals ways to improve.