Site Reliability Engineer — GPU Infrastructure

Reposted 25 Days Ago
Hiring Remotely in San Francisco, CA
In-Office or Remote
Mid level
Artificial Intelligence • Generative AI
The Role
Lead GPU cluster design and operations, manage Kubernetes, implement Infrastructure-as-Code, and develop observability stacks for high-performance AI models.
Summary Generated by Built In

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

This is a contract-to-hire position.

What You’ll Do
  • Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

  • Lead production Kubernetes operations: GPU scheduling, cluster upgrades, multi‑cluster federation.

  • Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

  • Build CI/CD pipelines, automated testing, and rollout strategies for infra changes.

  • Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

  • Optimize high‑performance networking (InfiniBand/RDMA) and debug perf bottlenecks.

  • Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

  • Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications
  • BS/MS/PhD in CS, EE, or related field.

  • 3+ yrs SRE/DevOps in production; 2+ yrs managing large Kubernetes fleets.

  • Expert‑level Kubernetes experience.

  • Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

  • GPU schedulers such as Slurm or Kueue.

  • Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

  • Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have
  • Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

  • Familiarity with CI/CD tooling (GitHub Actions, BuildKit).

  • Prior work with distributed training, model‑serving patterns, or other ML/GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish.

Top Skills

Ansible
Argo Cd
Bash
Ebpf
Flux
Gitops
Grafana
Helm
Infiniband
Kubernetes
Nvidia Dcgm
Opentelemetry
Prometheus
Python
Rdma
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
San Francisco, CA
50 Employees

What We Do

Enabling the next billion AI video creators with Genmo

Similar Jobs

Cox Enterprises Logo Cox Enterprises

Senior Product Marketing Manager

Automotive • Cloud • Greentech • Information Technology • Other • Software • Cybersecurity
Remote or Hybrid
United States
50000 Employees
120K-199K Annually

Atlassian Logo Atlassian

Head of SaaS & Workplace Technology Solutions

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
In-Office or Remote
San Francisco, CA, USA
11000 Employees
194K-303K Annually

Atlassian Logo Atlassian

Procurement Intern, 2026 Summer U.S.

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
In-Office or Remote
San Francisco, CA, USA
11000 Employees
28-48 Hourly

Atlassian Logo Atlassian

Financial Planning & Analysis Intern, 2026 Summer U.S.

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
In-Office or Remote
San Francisco, CA, USA
11000 Employees
28-48 Hourly

Similar Companies Hiring

Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account