Director, SRE

Reposted 8 Days Ago
Be an Early Applicant
4 Locations
In-Office
Senior level
Artificial Intelligence • Software
The Role
The Director of SRE will build and lead the Site Reliability Engineering team, focusing on ensuring maximum performance of GPU infrastructure through automation, monitoring, incident management, and effective customer support.
Summary Generated by Built In
About Fluidstack

We build and operate high-performance GPU clusters so the most ambitious teams can move fast, stay focused, and scale without friction. Our clusters power top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.

Our team is highly motivated, and focused on providing a world class supercomputing experience. We put our customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.

We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.

You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.

About the Role

The Director, SRE will build our Site Reliability Engineering team from scratch, creating a team responsible for guaranteeing the maximum availability and performance of our GPU infrastructure.

This role involves building reliability into our Slurm and Kubernetes platforms from the ground up. You will work directly with customers on a daily basis to support workload installation, monitoring, and debugging.

Key responsibilities include implementing systems to detect and drain broken nodes across Fluidstack-operated infrastructure. You will collaborate closely with the Infrastructure team to develop provisioning and configuration automation using Infrastructure as Code and DevOps best practices.

Focus
  • Build comprehensive monitoring with active and passive health checks

  • Define SLIs and SLOs for our managed Slurm + Kubernetes clusters

  • Create actionable alerts that wake people up only when necessary

  • Write runbooks that anyone can follow at 3am

  • Implement Infrastructure as Code for all cluster deployments

  • Prepare disaster recovery plans

  • Reduce toil through aggressive automation

  • Design and implement incident management processes

  • Drive postmortems that prevent repeat failures

  • Mentor engineers on SRE principles and practices

  • Implement and improve CI/CD processes

About You
  • 5+ years of SRE experience, including exposure to architecture an design

  • You've scaled infrastructure at a fast-growing company

  • You have experience with GPU workloads and HPC environments

  • You've managed Kubernetes or Slurm clusters in production

  • You write code to solve operational problems

  • You think in systems, not individual servers

  • You've automated yourself out of repetitive tasks

  • You can debug complex distributed systems under pressure

  • You've worked directly with demanding enterprise customers

  • You measure everything and make data-driven decisions

  • You've been on-call and improved the experience for others

  • You can explain complex systems simply

Nice to haves
  • Multi-region or multi-cloud deployments

  • Contributions to open source infrastructure tools

  • Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand)..

  • Excited to work with cutting edge AI training & inference hardware and networks

  • Experience with bare metal automation

Benefits
  • Competitive total compensation package (cash + equity)

  • Retirement or pension plan, in line with local norms

  • Health, dental, and vision insurance

  • Generous PTO policy, in line with local norms

Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

Top Skills

Ci/Cd
Gpu
Infrastructure As Code
Kubernetes
Slurm
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: London
30 Employees
Year Founded: 2017

What We Do

Instantly reserve dedicated clusters of NVIDIA H200s and GB200s for any scale to supercharge your training and inference workflows.

Similar Jobs

Samsara Logo Samsara

Enterprise Account Executive

Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
Easy Apply
Remote or Hybrid
Austin, TX, USA
195K-278K Annually

Samsara Logo Samsara

Enterprise Account Executive

Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
Easy Apply
Remote or Hybrid
Houston, TX, USA
195K-278K Annually

Samsara Logo Samsara

Enterprise Account Executive

Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
Easy Apply
Remote or Hybrid
Dallas, TX, USA
195K-278K Annually
Hybrid
Fort Worth, TX, USA

Similar Companies Hiring

Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account