Andromeda (andromeda.ai) Jobs

Customer Reliability Engineer

Andromeda (andromeda.ai)

Customer Reliability Engineer

Reposted 5 Days Ago

8 Locations

In-Office or Remote

Senior level

Artificial Intelligence • Cloud • Information Technology • Software

The Role

The Site Reliability Engineer will provision and manage Kubernetes clusters, build automation tools, debug customer issues, and improve infrastructure reliability.

Summary Generated by Built In

Customer Reliability EngineerLocation: Remote/SF-Hybrid · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

The Role

Our customers run large AI training and inference workloads on GPU clusters we source from providers worldwide. When a node goes dark or a job dies eight hours into a run, the Customer Reliability Engineer is who they hear from, and who gets it sorted.

The job has three parts. You triage incoming issues and debug them at the Linux and Kubernetes layer. You work provider-side to figure out whose fault something actually is and push external providers to fix it. And you build the monitoring and scripts that catch problems before a customer has to tell us.

You need to be comfortable in a Linux shell and know how Kubernetes works. You don't need GPU or HPC experience. Most people pick that up here.

What You’ll Do

Triage and fix customer issues

○ Own issues start to finish: reproduce, diagnose, fix or escalate, close the loop

○ Debug at the Linux layer: processes, networking, storage, kernel logs, resource contention, systemd, journald

○ Dig into Kubernetes problems like pods stuck pending or crash-looping, node conditions, scheduling failures, resource limits

○ Work GPU failures: driver and device-plugin issues, XID errors, thermal throttling, nodes that need cordoning or draining, jobs failing across multiple nodes

○ Escalate when you're past your depth, with the evidence already gathered

Handle incidents

○ Take part in a 24/7 on-call rotation

○ First response on alerts and customer-reported outages: assess impact, set severity, pull in the right people

○ Keep customers updated during incidents. Clear status, honest unknowns, no silence

○ Write up what happened, then turn it into a runbook, an alert, or a fix so it costs less next time

Push providers to resolution

○ Work out whether a fault is provider-side, ours, or the customer's before it gets handed anywhere

○ Open tickets with compute providers and chase them down rather than waiting

○ Track recurring provider failures and flag the patterns to the people making sourcing decisions

Build the tooling

○ Write Python or Bash to automate the checks you'd otherwise run by hand

○ Build and improve monitoring: cluster and node health checks, GPU telemetry, dashboards, alerts that fire on real problems

○ Keep runbooks and customer docs current as you go

What We’re Looking For

● Real Linux troubleshooting ability from the command line. You can work a problem through logs, processes, networking, and disk without a script to follow

● Working knowledge of Kubernetes: pods, nodes, deployments, services, scheduling, and how to investigate when one of those breaks

● Can write a script in Python or Bash to automate something repetitive

● Strong writing. You can explain a technical problem to a frustrated customer clearly and without condescension

● Good judgment under pressure. You know what to check first, when to escalate, and how to keep people informed while you're still working it out

● Willing to join a 24/7 on-call rotation

Strong Candidates May Have

● Hands-on time with NVIDIA GPUs in production: drivers, CUDA, DCGM, the Kubernetes device plugin

● Experience with high-performance networking (InfiniBand, RoCE) or NCCL

● Experience with HPC or batch schedulers like Slurm

● A previous customer-facing technical role: support engineering, TAM, solutions, professional services

● Knowledge of Prometheus, Grafana, Datadog, or similar

● IAC: Terraform, Ansible, or Helm

● Genuine interest in AI infrastructure and how big training jobs behave

Why You’ll Love It Here

● High-growth environment: Get in early at a company at the center of the AI infrastructure boom

● Ownership: First HPC Architect for the solutions engineering team, you’ll get to build this function from the ground up

● Competitive compensation: + meaningful equity

● Comprehensive benefits: for you and your dependents, including healthcare, dental, and vision coverage, 401(k), and unlimited PTO

Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Skills Required

5+ years experience in SRE, DevOps, or infrastructure engineering roles
Strong Linux systems and networking fundamentals
Deep experience with Kubernetes and container orchestration at scale
Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.)
Strong automation and scripting skills (Python, Go, or Bash)
Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.)
Track record of operating production systems and leading incident response

View all jobs at Andromeda (andromeda.ai)

View Andromeda (andromeda.ai) Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Francisco, California

17 Employees

What We Do

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.