Staff Software Engineer, ML Infrastructure

Reposted 20 Days Ago
San Francisco, CA, USA
Hybrid
300K-430K Annually
Senior level
Artificial Intelligence • Software
The Role
Own and build distributed training and inference platforms for LLMs and multimodal models, implement production training algorithms and inference optimizations, design multi-provider routing and failover, build evaluation/experimentation infrastructure, lead cross-functional initiatives, mentor engineers, and improve latency and cost efficiency.
Summary Generated by Built In

About Decagon

Decagon is the leading conversational AI platform empowering every brand to deliver concierge customer experiences.

Our technology enables industry-defining enterprises like Avis Budget Group, Block’s Cash App and Square, Chime, Oura Health, and Hunter Douglas to deploy AI agents that power personalized, deeply satisfying interactions across voice, chat, email, SMS, and every other channel.

We’re building a future where customer experiences are being redefined from support tickets and hold music to faster resolutions, richer conversations, and deeper relationships. We’re proud to be backed by world-class investors who share that vision, including a16z, Accel, Bain Capital Ventures, Coatue, and Index Ventures, along with many others.

We’re an in-office company, driven by a shared commitment to excellence and velocity. Our values — Just Get It Done, Invent What Customers Want, Winner’s Mindset, and The Polymath Principle — shape how we work and grow as a team.

About the Team

The ML Infrastructure team builds the systems that power every stage of Decagon's model lifecycle. We own the platforms for model training, the infrastructure for model evaluation and experimentation, and the routing layer that manages inference across multiple providers.

We work at the intersection of research and production: translating cutting-edge ML models into reliable, scalable systems that run in customer environments. We collaborate closely with Research, Infrastructure, and Product teams to ensure models train efficiently, serve reliably, and deliver exceptional user experiences.

The team values technical rigor, pragmatic decision-making, and building systems that others love to use.

About the Role

We're hiring a Staff ML Infrastructure Engineer to own the platforms powering Decagon's model training and inference. You'll build distributed training systems, design inference architecture across multiple providers, and create the frameworks that let our Research and Product teams ship faster.

This role is for someone who thrives on technical depth, can lead multi-quarter initiatives, and wants to shape the long-term architecture of our ML stack.


In this role, you will
  • Design and build distributed training platforms for LLM and multimodal fine-tuning and post-training at scale

  • Integrate state-of-the-art training algorithms into production pipelines

  • Own inference architecture and multi-provider routing, including failover and optimization

  • Lead initiatives to improve latency and cost efficiency across the training and serving stack

  • Build evaluation and experimentation infrastructure that enables rapid, reliable iteration

  • Drive technical direction, mentor engineers, and establish best practices for ML infrastructure


Your background looks something like this
  • 10+ years building ML infrastructure or production systems at scale

  • Deep experience with distributed training: multi-node GPU clusters, fault tolerance, and optimization

  • Strong understanding of LLM inference: latency optimization, provider tradeoffs, and serving architecture

  • Proven track record leading complex, multi-quarter technical projects


Compensation

$200K – $400K + Offers Equity
This range reflects the expected compensation for this role. Compensation within the range is determined based on experience, skills, and the scope of responsibilities, with flexibility for candidates who demonstrate exceptional impact.
In addition to base salary, we offer competitive equity. Final compensation may vary based on location within the United States.

Benefits

We proudly offer the following benefits for our full-time employees:

  • Take what you need vacation policy (subject to local requirements; UK employees receive 25 days of statutory leave)

  • Medical, Dental, and Vision benefits for you and your family

  • Life Insurance and Disability Benefits

  • Retirement Plan (e.g., 401K, pension)

  • Parental Leave

  • Fertility and family building benefits through Carrot

  • Daily lunches and snacks in the office to keep you at your best

These benefits are described in more detail in Decagon’s policies, may vary by location, and can change at any time according to applicable compensation and benefits plans.

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
49 Employees

What We Do

Trusted by world-class companies, Decagon is the most advanced AI platform for customer support.

Similar Jobs

Voxel Logo Voxel

Staff Software Engineer

Artificial Intelligence • Security • Software
Hybrid
San Francisco, CA, USA
62 Employees
220K-260K Annually

Nuro Logo Nuro

Staff Software Engineer

Artificial Intelligence • Automotive • Information Technology • Robotics
In-Office
Mountain View, CA, USA
908 Employees
194K-352K Annually

Zscaler Logo Zscaler

Operations Specialist

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Remote or Hybrid
San Jose, CA, USA
8697 Employees
105K-150K Annually

Zscaler Logo Zscaler

Staff Software Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Hybrid
San Jose, CA, USA
8697 Employees
154K-220K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Other • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account