Site Reliability Engineer - AI Infrastructure

Posted 2 Days Ago
Hiring Remotely in San Francisco, CA, USA
In-Office or Remote
Senior level
Artificial Intelligence • Cloud • Information Technology • Software
The Role
The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.
Summary Generated by Built In

Site Reliability Engineer - AI Infrastructure

Location: Global Remote / San Francisco · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets.

We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

What You’ll Do
  • Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.

  • Build automation and tooling to streamline cluster deployments and integrations.

  • Debug customer issues across networking, storage, scheduling, and system layers.

  • Improve reliability and scalability of both training and inference infrastructure.

  • Design and implement monitoring, alerting, and observability for critical systems.

  • Collaborate with engineering and product teams to plan and deliver infrastructure for new services.

  • Participate in on-call and incident response, leading postmortems and reliability improvements.

    What We’re Looking For

  • 5+ years experience in SRE, DevOps, or infrastructure engineering roles.

  • Strong Linux systems and networking fundamentals.

  • Deep experience with Kuber

Kubernetes and container orchestration at scale.
  • Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.).

  • Strong automation and scripting skills (Python, Go, or Bash).

  • Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.).

  • Track record of operating production systems and leading incident response.

Nice to Have
  • Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.).

  • Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).

  • Customer-facing support or consulting experience.

Why You’ll Love It Here

This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Top Skills

Ansible
Bash
Datadog
Go
Grafana
Helm
Kubernetes
Loki
Prometheus
Python
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
17 Employees

What We Do

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Similar Jobs

Andromeda (andromeda.ai) Logo Andromeda (andromeda.ai)

Senior Site Reliability Engineer

Artificial Intelligence • Cloud • Information Technology • Software
In-Office or Remote
San Francisco, CA, USA
17 Employees

Deepgram Logo Deepgram

Site Reliability Engineer

Artificial Intelligence • Machine Learning • Natural Language Processing • Software • Conversational AI
Remote
USA
150 Employees
150K-220K Annually

Toast Logo Toast

Architect

Cloud • Fintech • Food • Information Technology • Software • Hospitality
Remote
USA
5000 Employees
142K-227K Annually

BlackLine Logo BlackLine

Software Engineer

Cloud • Fintech • Information Technology • Machine Learning • Software • App development • Generative AI
Remote or Hybrid
Pleasanton, CA, USA
1810 Employees
225K-282K Annually

Similar Companies Hiring

Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account