How Duolingo Builds Its Data Science Methodology
For business leaders and other powers that be within an organization, data science can be a mysterious, almost-magical tool: They don’t necessarily understand it, but in it they see the possibility of answering any question they have about their users, business, revenue or product.
So when a request lands on the desk of a data science team, it can sometimes betray the author’s lack of understanding around its viability — or lack thereof.
Data science methodology is the task of crafting a project that answers the needs of customers and colleagues alike and requires a deep understanding of a request and the motivations behind it. Following a rigorously defined methodology, a well-run data science team will then craft a project plan that defines and collects the type of data they need and then prepares and models the data to enhance their understanding of the insights contained therein. Finally, it must be ready to deploy the requested tool or feature with the expectation that feedback processes will probably require some post-production tweaks or updates.
This, of course, is your average data scientist’s bread and butter. But what does this process actually look like in practice?
With more than 300 million users completing more than 7 billion exercises each month, language-learning platform Duolingo offers an example of data science methodology in action. Not only do the company’s enormous databases inform tweaks to Duolingo’s user experience and underlying infrastructure all the time, but the company’s data science teams conduct regular research into everything from optimizing reminder notifications to theories on how to improve teaching practices and outcomes for learners of indigenous languages.
Duolingo’s data science methodology underpins much of this work. To learn more about the nuts and bolts of how a project moves from an amorphous idea to a usable tool or valuable insight, Lead Data Scientist Erin Gustafson — one of RE•WORK’s Top 30 Women Aiding AI Advancement back in 2019 — took us through her team’s best practices.
Erin Gustafson, Lead Data Scientist at Duolingo
What are your team’s best practices when designing your data science methodology for a new project?
Our number one best practice is a project kickoff process that we’ve been honing over time. Most of our projects go through this process, which involves drafting a kickoff document and scheduling a meeting with key stakeholders to discuss the plan. We’ve found that both phases of this process add a ton of value.
At the doc phase, data scientists work with their managers and team leads to define the goals, requirements, key stakeholders, technical approach and timeline for the project. This phase forces us to do the important foundational thinking for a project so we can make sure we have the data we need — more than once, the kickoff process has helped us realize we don’t — and that the project has high ROI.
In the kickoff meeting, the data scientist talks through the plan and any areas that need further alignment with cross-functional stakeholders. The cross-functional nature of this meeting is really important because the success of a data science project is not solely determined by how well the technical approach is executed — success is also driven by the impact that the work has on the product or business more generally. Including product managers, engineers, learning scientists and others in the meeting ensures that we’re asking the right questions and plan to answer them appropriately.
This brings me to another best practice or key principle for deciding on a data science methodology: Don’t let perfect be the enemy of good. As a small data science team in a fast-moving company, we don’t often have the luxury of spending months on a single project. This means that we think iteratively about data science projects and often agree to start with a minimum viable product model that can deliver “good enough” insights or predictions, and level up the approach later once we’ve demonstrated the value of the model. This allows us to move quickly and take on more projects.
“Understanding how your work will be used is an important part of choosing your technical approach.”
What’s an example of your methodology in action?
A recent initiative that encapsulates a lot of our best practices was a revamp of our learner forecasting methodology. For the last couple of years, we’ve relied on a methodology that gave us a fairly accurate forecast (even during COVID times) but required a ton of overhead to update and maintain. We decided to take a step back at the beginning of this year to take stock of our approach and consider alternatives.
We began by going through our typical kickoff process. This ended up being invaluable because the requirements of this project were fairly complex. We wanted to find a new methodology that would be easier to maintain, more flexible so we could add functionality as our business matures, more robust from a statistical perspective and be at least as accurate as the legacy approach. What’s more, we also needed to make sure we were satisfying the growing needs of our stakeholders in marketing, finance and product. The kickoff process made sure that we were clear on what success looked like and that we had buy-in from our stakeholders about the revamp.
How has your data science methodology process evolved over time?
A recent addition to our process is after-action reviews. This is a practice that our engineering organization has used in the past to reflect on lessons learned from past projects. After-action reviews often involve a similar cross-functional group as our kickoff meetings and they give us an opportunity to identify aspects of our process or technical approach that worked well, fell short or could be improved. We’ve started to incorporate this into the standard lifecycle of a data science project. For example, we recently wrapped work on an MVP model, reflected on the project as a team in an after-action review and immediately applied those learnings in a kickoff doc for the next iteration on the model. These two processes in tandem have helped us work smarter.
What are some common ways in which a faulty methodology can compromise a data science project?
A process that does not ensure a well-defined goal for the project can cause a myriad of problems. For example, not being aligned on goals could mean that the data scientist doesn’t understand the use case for the model they’re building. Success for a model looks different depending on whether you hope to draw strong inferences from your model versus generate accurate predictions. Understanding how your work will be used is an important part of choosing your technical approach. A strong process for kicking off data science projects ensures that data scientists and their key stakeholders get on the same page early in the lifecycle of a project.