Staff Site Reliability Engineer

Posted Yesterday
Easy Apply
Hiring Remotely in US
Remote or Hybrid
200K-230K Annually
Senior level
Artificial Intelligence • Machine Learning
Unleash data science, one innovation at a time.
The Role
Lead development of AI-assisted reliability tooling, own incident response end-to-end, improve observability and SLO/SLI frameworks, scale single-tenant SaaS operations, mentor engineers, and reduce recurring operational toil through engineering and automation.
Summary Generated by Built In

Who we are 

At Domino, we build software that helps the largest, AI-driven organizations build and operate advanced data science and AI solutions at scale. Our platform integrates a streamlined model development environment, MLOps capabilities, and novel features for collaboration, reuse, and reproducibility — all of which make data science teams more productive, reduce time to value, and ensure compliance. Our customers — like Johnson & Johnson, GSK, Bristol Myers, UBS, FINRA and the US Navy — are using our software to solve some of the most important challenges in the world, such as developing new medicines, securing our financial markets, or protecting our country. Backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake, NetApp and other leading investors, we have been in business for a decade but are still a small team operating with the spirit of a startup. Especially in the world of AI today, we believe that the future is still being invented — and we want to be the ones building it. For more information, visit www.domino.ai

What we are building

As our infrastructure and customer footprint grow, we're investing in a new kind of SRE practice where the people who respond to incidents also build the systems that make future incidents shorter, rarer, and less painful. We're developing AI-assisted tooling that helps our support and engineering teams diagnose problems faster, learn from outages more deeply, and automate away the toil that slows everyone down. This role sits at the center of that: equal parts hands-on operator, software engineer, and technical leader. If you believe that operational experience and engineering craft make each other stronger, you'll feel right at home here.

What your impact will be 

  • Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil
  • Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle
  • Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur
  • Guide the development of customer and user-facing observability tools within our products
  • Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards
  • Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades
  • Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture

What we look for in this role

  • Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
  • Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
  • A strong ability to perceive and close reliability gaps in technical products, tools and processes
  • Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
  • Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
  • A history of improving reliability through engineering and automation, not just putting out fires manually
  • Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
  • Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
  • Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams

What we value

  • We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply
  • We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
  • We believe in individuals who seek truth and speak the truth and can be their whole selves at work. 
  • We value all of you that believe improving is always possible. At Domino, everything is a work in progress – we can do better at everything. 
  • We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company.

#LI-Remote

The annual US base salary range for this role is listed below. For sales roles, the range provided is the role's On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range will be narrowed during the interview process based on a number of factors, including the candidate's experience, qualifications, and location. Additional benefits for this role may include: equity, company bonus or sales commissions/bonuses; 401(k) plan; medical, dental, and vision benefits; and wellness stipends.

Compensation Range
$200,000$230,000 USD

Skills Required

  • Deep experience in Site Reliability Engineering, platform engineering, or software engineering with hands-on operational ownership
  • Fluency with Kubernetes
  • Fluency with Linux
  • Fluency with cloud platforms
  • Experience with observability tooling and analyzing logs/traces to investigate production issues
  • Strong software engineering skills in Python or Go
  • Ownership of incident response end-to-end and improving post-incident documentation and processes
  • Ability to define and mature SLO/SLI frameworks for priority services
  • History of improving reliability through engineering and automation (not only manual firefighting)
  • Strong communication skills and experience mentoring engineers or shaping technical decision-making
  • Sound judgment about AI/LLM tooling and where it helps operational workflows
  • Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support/developer teams

What the Team is Saying

Claus Murmann
Melissa Smith

Domino Data Lab Compensation & Benefits Highlights

  • Healthcare Strength Medical, dental, and vision plans include a 100% premium-covered option for employees and families, alongside mental-health coverage and a Calm subscription.
  • Leave & Time Off Breadth Flexible (“unlimited”) PTO, paid holidays, and remote-work support are emphasized, offering substantial flexibility in time away and work location.
  • Wellbeing & Lifestyle Benefits Perks such as a $500 home-office stipend for new hires, a $500 annual fitness reimbursement, and commuter benefits add meaningful day-to-day support.

Domino Data Lab Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
200 Employees
Year Founded: 2013

What We Do

Domino Data Lab powers model-driven businesses with its leading Enterprise AI platform trusted by over 20% of the Fortune 100. Domino accelerates the development and deployment of data science work while increasing collaboration and governance. With Domino, enterprises worldwide can develop better medicines, grow more productive crops, build better cars, and much more. Founded in 2013, Domino is backed by Coatue Management, Great Hill Partners, Highland Capital, Sequoia Capital and other leading investors. For more information, visit www.domino.ai

Why Work With Us

We’re looking for sharp, scrappy people who crave a high degree of ownership, are laser-focused on personal growth, and can stick the landing between high standards and low ego. In our fast-paced environment, you’ll find all the white space and opportunity you need to thrive.

Gallery

Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery
Gallery

Domino Data Lab Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

Typical time on-site: Flexible
HQSan Francisco, CA
London, UK
Argentina (Remote Hub)
Learn more

Similar Jobs

Domino Data Lab Logo Domino Data Lab

Content Marketing Manager

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
110K-120K Annually

Domino Data Lab Logo Domino Data Lab

Senior Quality Engineer

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
145K-175K Annually

Domino Data Lab Logo Domino Data Lab

Enterprise Account Executive

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
300K-350K Annually

Domino Data Lab Logo Domino Data Lab

Experience Manager

Artificial Intelligence • Machine Learning
Easy Apply
Remote or Hybrid
US
200 Employees
90K-110K Annually

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account