Staff Platform Reliability Engineer

Reposted 4 Days Ago
Easy Apply
Hiring Remotely in US
Remote or Hybrid
185K-230K Annually
Senior level
Artificial Intelligence • Machine Learning
Unleash data science, one innovation at a time.
The Role
Own and modernize Domino's Tempest scale-testing platform; build repeatable automated validation, sizing guidance, and cloud-scale test automation; partner with platform teams to enable multi-cloud scale testing and improve test reliability and reporting.
Summary Generated by Built In

Who we are

At Domino, we build software that helps the largest, AI-driven organizations build and operate advanced data science and AI solutions at scale. Our platform integrates a streamlined model development environment, MLOps capabilities, and novel features for collaboration, reuse, and reproducibility — all of which make data science teams more productive, reduce time to value, and ensure compliance. Our customers — like Johnson & Johnson, GSK, Bristol Myers, UBS, FINRA and the US Navy — are using our software to solve some of the most important challenges in the world, such as developing new medicines, securing our financial markets, or protecting our country. Backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake and other leading investors, we have been in business for a decade but are still a small team operating with the spirit of a startup. Especially in the world of AI today, we believe that the future is still being invented — and we want to be the ones building it. For more information, visit www.domino.ai

What we are building

The Automation Team at Domino acts as a force multiplier for engineering, building the tools and systems that enable teams to ship code confidently and consistently. A core part of this mission is Tempest, an in-house platform that orchestrates realistic, long-duration workloads against live Kubernetes clusters and validates the results against real observability data. Today, when scale testing surfaces a bottleneck, a resource misconfiguration, or a regression in system behavior, the team can identify and report the issue — but we need someone who can take the next step: profiling services, tracing root causes through Prometheus and New Relic data, and partnering with platform engineers to drive durable fixes. Focused on iteration and continuous improvement, the team looks for targeted enhancements that create outsized impact, and this role will close the gap between detection and resolution at the infrastructure level.

What your impact will be

In your first year, you will:

  • Serve as the technical owner of Tempest, Domino's scale and reliability platform, ensuring it remains reliable, extensible, and aligned with evolving infrastructure needs
  • Diagnose and drive resolution of performance bottlenecks and resource misconfigurations surfaced by scale testing — working directly with platform and infrastructure teams to ship fixes, not just file tickets
  • Deliver accurate, data-driven sizing recommendations for customer-facing documentation based on rigorous empirical testing across deployment sizes
  • Strengthen observability across scale testing by improving Prometheus and New Relic instrumentation, making it faster to pinpoint root causes during and after multi-day load runs
  • Establish and operationalize scale testing on cloud platforms, ensuring appropriate sizing and configuration guidance for this increasingly divergent product line
  • Partner with platform teams to enable effective scale and reliability testing across additional cloud providers, helping position Domino for future multi-cloud success
  • Increase the efficiency and leverage of a small team by building infrastructure automation that scales operationally as the product and customer base grow

What we look for in this role

  • Background in SRE, platform engineering, or infrastructure with hands-on experience operating and troubleshooting distributed systems in production Kubernetes environments
  • Strong proficiency in Python and comfort working in a large, modular codebase that spans orchestration, infrastructure automation, and systems integration
  • Experience with observability stacks (Prometheus, Grafana, New Relic, or similar) — writing queries, building dashboards, and using metrics to diagnose performance and reliability issues at the systems level
  • Demonstrated ability to go beyond detection to resolution: profiling services, identifying resource bottlenecks, and working with engineering teams to ship durable fixes
  • Familiarity with performance and load testing methodologies (e.g., Locust, k6, or similar) as part of a broader infrastructure or reliability practice
  • Clear ownership mindset — self-directed, accountable, and able to communicate priorities and status effectively in a remote, async environment

What we value

  • We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
  • We believe in individuals who seek truth and speak the truth and can be their whole selves at work
  • We value all of you that believe improving is always possible At Domino Everything is a work in progress – we can do better at everything
  • We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company
  • We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply

#LI-Remote

The annual US base salary range for this role is listed below. For sales roles, the range provided is the role's On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range will be narrowed during the interview process based on a number of factors, including the candidate's experience, qualifications, and location. Additional benefits for this role may include: equity, company bonus or sales commissions/bonuses; 401(k) plan; medical, dental, and vision benefits; and wellness stipends.

Compensation Range
$185,000$230,000 USD

Top Skills

Ci Systems
Cloud Platforms
Cloud-Native Tooling
End-To-End Frameworks
Kubernetes
Multi-Cloud
Performance/Load Testing Frameworks
Python
Tempest

What the Team is Saying

Claus Murmann
Melissa Smith
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
200 Employees
Year Founded: 2013

What We Do

Domino Data Lab powers model-driven businesses with its leading Enterprise AI platform trusted by over 20% of the Fortune 100. Domino accelerates the development and deployment of data science work while increasing collaboration and governance. With Domino, enterprises worldwide can develop better medicines, grow more productive crops, build better cars, and much more. Founded in 2013, Domino is backed by Coatue Management, Great Hill Partners, Highland Capital, Sequoia Capital and other leading investors. For more information, visit www.domino.ai

Why Work With Us

We’re looking for sharp, scrappy people who crave a high degree of ownership, are laser-focused on personal growth, and can stick the landing between high standards and low ego. In our fast-paced environment, you’ll find all the white space and opportunity you need to thrive.

Gallery

Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery

Domino Data Lab Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

Typical time on-site: Flexible
HQSan Francisco, CA
London, UK
Argentina (Remote Hub)
Learn more

Similar Jobs

Domino Data Lab Logo Domino Data Lab

Staff Software Engineer

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
200K-250K Annually

Domino Data Lab Logo Domino Data Lab

Content Marketing Manager

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
110K-120K Annually

Domino Data Lab Logo Domino Data Lab

Revenue Accountant

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
Washington, CA, USA
200 Employees
100K-120K Annually

Domino Data Lab Logo Domino Data Lab

Principal Product Manager

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
250K-300K Annually

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account