Staff Product Manager, AI Eval Platform

Posted 3 Days Ago
Hiring Remotely in United States
Remote
230K-311K Annually
Expert/Leader
Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
Dropbox isn’t just a workplace—it’s a living lab for more enlightened ways of working.
The Role
Lead the AI Evaluations product framework at Dropbox, focusing on metrics, model performance, and user satisfaction for AI features. Collaborate with AI, data, and research teams to define success metrics and operationalize evaluation tools.
Summary Generated by Built In
Role Description

As a Staff Product Manager within the Dash organization, you will play a crucial role in shaping how we measure and evaluate our AI-powered assistant and features. Dropbox is seeking a Staff Product Manager to lead AI Evaluations (Evals) — the systems, metrics, and processes that measure the quality and reliability of AI-powered features across Dropbox. In this role, you’ll define how we evaluate model performance, accuracy, and user satisfaction across diverse AI surfaces like Dash, search, summarization, and intelligent organization. You will be responsible for a core platform that enables every product team at Dropbox to launch new AI features with confidence, armed with the tools to measure their success both online and offline.

You’ll collaborate closely with Applied AI, Data Science, and Research to design frameworks that ensure our AI features are helpful, safe, and high-quality. This includes everything from defining success metrics for model improvements, to building scalable pipelines that assess qualitative and quantitative signals at scale.

This role sits at the intersection of AI systems, data rigor, and product judgment — ideal for a PM who loves turning ambiguity into measurable progress and ensuring that every AI interaction meets a bar of excellence.

Responsibilities
  • Define and drive the roadmap for Dropbox’s AI Evaluation Framework, covering both quantitative metrics and human-in-the-loop systems.
  • Define the strategic vision and north-star framework for how Dropbox measures AI performance, setting unified principles for quality, correctness, relevance, and reliability across Dash and other AI features.
  • End to end ownership of offline scoring pipelines, online instrumentation, dashboards, APIs, and LLM-as-Judge components used by all product teams.
  • Build and scale a self-serve measurement platform that enables any Dropbox team to launch features, run experiments, and measure performance with minimal friction.
  • Collaborate cross-functionally with ML, product, engineering, research, and data science to operationalize evaluation pipelines, design rubrics, and ensure metrics are valid, reproducible, and reliable.
  • Establish and maintain company-wide evaluation standards by defining rubrics, extending scorer taxonomies, and guidelines that become the foundation for AI quality measurement and benchmarking.
  • Integrate measurement systems into the product lifecycle by partnering with PMs and engineering to ensure evaluation and feedback loops are embedded from ideation through launch and iteration.
  • Communicate results, insights, and trade-offs to senior leadership, influencing product decisions and roadmap prioritization through clear storytelling backed by rigorous data.
Requirements
  • 10+ years of experience building measurement, analytics, or evaluation platforms, ideally in an ML/AI context (e.g. experimentation platform, metrics infrastructure, evaluation pipelines) particularly with an understanding of the end-to-end AI development lifecycle, from model training to deployment and monitoring.
  • Experience designing and deploying evaluation frameworks and pipelines. E.g. solid offline vs online evaluation, metric definition and calibration, and human + model adjudication where needed. 
  • Deep understanding of ML evaluation, metrics, statistics. E.g. AUC, precision/recall, calibration, bias detection, variance, error analysis.
  • Technical fluency and ability to partner with engineers, software engineers, and data scientists. Candidate is comfortable reasoning about pipelines, APIs, performance, scale, latency, system tradeoffs, and more, with the ability to engage in deep technical discussions with engineers and data scientists, and translate complex technical concepts into clear product requirements.
  • Strong cross-functional collaboration skills. You will  need to work with PMs, researchers, engineers, data teams, labeling teams, and senior leaders.
  • Exceptional written and verbal communication skills, with a demonstrated ability to create clear, structured product documents and effectively communicate vision, trade-offs, and progress to stakeholders at all levels, including executives.
  • Bias, fairness, robustness mindset. Experience (or sensitivity) in designing evaluation with fairness / adversarial robustness / edge cases in mind.
Preferred Qualifications
  • Experience with developing or implementing LLM-based evaluation frameworks within a RAG (Retrieval-Augmented Generation)  context while leveraging LLM as a Judge for online evaluations. 
  • Hands-on experience with prompt evaluation, rubric design, human-in-the-loop evaluation, adversarial test design
  • Familiarity with experimentation at scale, including test design and measurement . e.g.  A/B testing systems, causal inference, counterfactual measurement.
  • 5+ years of experience in building self-service internal platforms / ML infrastructure / SDKs / APIs.
  • Experience building platforms or internal tools for technical users or developers and non-technical audiences alike. 
  • PhD or advanced degree in a quantitative field (CS, ML, statistics, etc.).
Compensation
US Zone 1
$229,500$310,500 USD
US Zone 2
$206,600$279,500 USD
US Zone 3
$183,600$248,400 USD

Top Skills

AI
Data Science
Ml

What the Team is Saying

Sukrith
Veronica
Lisa
Mack
Latane Garetson
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
2,500 Employees
Year Founded: 2007

What We Do

We're a global community of bold visionaries and resourceful doers who are shaping the future of Dropbox—and with it the future of work. Our Virtual First model combines the flexibility of a distributed workplace with the power of human connection, making space for both meaningful work and meaningful relationships. With our start-up mindset and enterprise-level opportunities, you can be who you are and grow into who you’re meant to be. Here, you can own your impact to make work more intuitive, joyful, and human—for you as a Dropboxer and for hundreds of millions of people worldwide. If you're ready to push boundaries—and yourself—Dropbox is ready for you.

Why Work With Us

We believe people do their best work when empowered with autonomy and harmony, and we understand there’s no substitute for human connection. Our Virtual First model combines the flexibility of remote work with the power of in-person collaboration to create the best of both worlds: a distributed workplace, anchored in community.

Gallery

Gallery
Gallery
Gallery
Gallery
Gallery
Gallery

Dropbox Offices

Remote Workspace

Employees work remotely.

While remote work is the primary experience for our employees, we also prioritize opportunities for quarterly in-person collaboration knowing that connection is vital to a thriving workforce. We focus on how we work, not where we work.

Typical time on-site: None
Company Office Image
HQSan Francisco, CA
Japan
CO
Canada
Singapore
Mexico
Company Office Image
Poland
Austin, TX
Austin, Texas
Boston, MA
Chicago, IL
Company Office Image
Dublin, IE
United Kingdom
Los Angeles, CA
New York, NY
Company Office Image
Seattle, WA
Learn more

Similar Jobs

Dropbox Logo Dropbox

Director Social Media and Content Strategy

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
In-Office or Remote
Select, KY, USA
189K-256K Annually

Dropbox Logo Dropbox

Software Engineer

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
In-Office or Remote
Select, KY, USA
217K-293K Annually

Dropbox Logo Dropbox

Senior Data Scientist

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
In-Office or Remote
Select, KY, USA
168K-228K Annually

Dropbox Logo Dropbox

Product Manager

Artificial Intelligence • Cloud • Consumer Web • Productivity • Software • App development • Data Privacy
Remote
United States
230K-311K Annually

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account