hud

Senior DevOps & Infrastructure Engineer

Posted 3 Days Ago

San Francisco, CA

In-Office

160K-280K Annually

Senior level

Artificial Intelligence • Information Technology • Software

The Role

You will own and improve a high-performance infrastructure for AI training, focusing on reliability, observability, and cost-efficiency.

Summary Generated by Built In

Senior DevOps & Infrastructure EngineerAbout HUD

At HUD, we’re building the future of how companies and individuals train and evaluate AI. We believe that in the near future, most post-training data used to align and improve LLMs will flow through HUD.

We build a platform and developer tools that let teams create post-training data through RL environments and run reinforcement fine-tuning (RFT) reliably, reproducibly, and at scale.

We’re trusted by foundation labs, Fortune 500s, and fast-growing startups.

We’re also a high-caliber team: former founders, published ML researchers, Olympiad medalists, and engineers who have built products with real adoption. We run lean, move fast, and hold an extremely high bar.

The Role

We run a platform + SDK/dev tools for creating RL environments/post-training data and running reinforcement fine-tuning at scale. A key part of that experience is our infra and developer sandboxes: fast, reliable, observable, Dockerized compute environments with massive parallelization.

We’re looking for an infrastructure owner who is obsessed with performance and reliability—someone who treats shaving seconds off sandbox lifecycle and runtime performance as a sport.
You’ll own DevOps, infrastructure and architecture decisions as we hit our next order of scale.

Who you are

You are an infrastructure owner, not a dashboard watcher
You don’t wait for tickets—you proactively find bottlenecks, measure them, fix them, and prove the gains. You ship improvements that compound.

You care about tail latencies and failure modes
You think in SLOs, load patterns, saturation curves, and blast radius. You design for the real world: retries, backpressure, partial failures, and noisy neighbors.

You love performance
You enjoy turning “slow and expensive” into “fast and efficient.” You benchmark, profile, tune, and iterate.

You can operate autonomously
You are comfortable making high-stakes engineering decisions with good judgment, and communicating tradeoffs clearly to the team.

You'll own and evolve HUD’s infrastructure so it is:

Extremely performant (fast sandbox provisioning, fast cold starts, low tail latency, high throughput)
Extremely reliable (predictable behavior, graceful failure, robust scaling, low operational risk)
Operationally excellent (systems scale, clear SLOs, deep observability, incident readiness, cost discipline)
Secure and compliant (SOC 2-aligned practices, strong security posture by default)

What you’ll work on

Developer sandbox infrastructure

Own our AWS + EKS-based sandbox platform that runs Dockerized workloads for customers and internal teams.
Optimize sandbox lifecycle end-to-end: provisioning, scheduling, image pulls, startup, execution, teardown, and caching.
Design for massive parallelism while maintaining reliability, fairness, and predictable performance.

Kubernetes + AWS excellence

Evolve our cluster architecture: node groups, autoscaling strategies, spot/on-demand mixes, scheduling policies, and workload isolation.
Build safe-by-default patterns: quotas, resource limits, network policies, pod security, secrets management, and guardrails.
Improve cluster resiliency and operational ergonomics (upgrades, rollouts, disaster recovery, fail-safes).

Cross-stack DevOps ownership

Address infrastructure bottlenecks as we scale.
Improve developer experience for internal teams: safer deploys, better CI/CD, smoother local/dev workflows, faster iteration.
Provide architectural input and raise the infra maturity of the team via docs, patterns, and coaching.
Interface with our backend/workers (Railway), frontend (Vercel/Next.js), and data (Supabase/Postgres) to ensure the whole system is cohesive.

Performance engineering and ruthless measurement

Establish “infra product metrics” and instrument everything: P50/P95/P99 sandbox startup times, queue times, job success rates, noisy-neighbor rates, image pull latencies, cluster saturation, and cost-per-run.
Build benchmarking harnesses for sandboxes and workloads to track regressions and validate improvements.
Treat efficiency as a first-class metric: optimize utilization without sacrificing latency or reliability.

Observability + incident readiness

Implement gold-standard observability across logs/metrics/traces with actionable dashboards and alerting tied to SLOs.
Create runbooks, incident processes, and postmortem culture that meaningfully improves the system each time.

Requirements

Deep AWS experience, including operating production systems at scale (networking, IAM, compute, storage, observability, cost).
Strong Kubernetes/EKS experience: cluster design, workload isolation, autoscaling (cluster + pod), upgrades, reliability practices.
Excellent Docker + container runtime knowledge: image optimization, build pipelines, caching strategies, and runtime security considerations.
Systems-level competence: Linux fundamentals, networking, performance debugging, resource contention, concurrency basics.
Infrastructure automation: strong ability to implement infrastructure as code (Terraform/CDK/CloudFormation) and repeatable environments.
Observability expertise: metrics/logging/tracing design, SLOs/SLIs, alerting that avoids noise and catches real issues.
Security + compliance mindset: experience working in SOC 2-aligned environments; ability to implement least privilege, auditability, and operational controls.
Strong engineering communication: can write clear docs, propose designs, and upskill the team.

Strong pluses

Experience building ephemeral compute / sandbox / job execution platforms (multi-tenant, Dockerized workloads, queueing, isolation).
Proven wins reducing cold start / startup time and improving p95/p99 latency for infra-critical paths.
Deep familiarity with:
- Karpenter / Cluster Autoscaler, HPA/VPA, pod scheduling strategies, priority classes, taints/tolerations, topology spread constraints
- Container performance: image layering, registry optimization, pull-through caches, snapshotters, prewarming strategies
- Service mesh / networking (where appropriate), network policies, ingress design, egress controls
Experience migrating from mixed hosting providers into a more cohesive platform architecture.
Experience with CI/CD at high velocity (safe deploys, progressive delivery, canaries, rollbacks).
Experience with GPU infrastructure and orchestration (if applicable to workloads).
Security depth beyond basics: threat modeling, hardening, secure supply chain for containers, audit-readiness workflows.
Ability to contribute across the stack:
- Python (our SDK and backend systems) and Next.js/TypeScript, enough to collaborate effectively with other engineers.
Strong fluency with AI coding tools (using them to accelerate debugging, automation, and implementation without sacrificing correctness).

What success looks like

Sandbox startup times drop dramatically and stay low as load increases.
Reliability improves: fewer failed runs, better isolation, clearer error modes, faster recovery.
Costs become intentional and explainable, with clear cost-per-run and utilization targets.
Internal teams feel the difference: faster iteration, fewer footguns, smoother deployments.
The organization gains durable infra patterns, not just one-off fixes.

Why you’ll love it here

Hard problems with real impact: your work directly shapes the product experience for cutting-edge AI teams.
High-caliber peers: teammates who value clarity, rigor, and craft.
Meaningful ownership: you’ll own critical infrastructure and set the standard for how we operate at scale.

Logistics

Locations: San Francisco / Singapore
Type: Full-time, In-Person
Visa/Relocation: Available for strong candidates (US/Singapore)
Compensation: $160,000-$280,000 salary, meaningful equity, full healthcare, daily team meals.