Platform Engineer

Posted 2 Days Ago
Be an Early Applicant
San Francisco, CA, USA
In-Office
Senior level
Artificial Intelligence • Information Technology • Software
The Role
Own reliability, scale, performance, and developer experience for HUDs core infrastructure. Build and maintain AWS/Terraform/Kubernetes-based systems, CI/CD, observability, and deployment workflows. Drive incident response, cost optimization, autoscaling, and backend/design improvements; write automation and internal tools to improve reliability and developer productivity.
Summary Generated by Built In
About HUD

HUD is building infrastructure to create RL training data and evals for frontier AI agents, as well as a marketplace to sell these to frontier labs through the HUD marketplace. Our platform is used by frontier labs, Fortune 500 companies, and startups. We’ve raised $16M from top VCs and were YC W25.

About the role

We’re looking for a platform engineer who can own the reliability, scale, performance, and developer experience of HUD’s core infrastructure and backend systems.

This is not a pure infrastructure role. The right person has strong production infra experience, but also thinks like a backend engineer: they can reason about service architecture, queues, databases, APIs, deployment safety, performance bottlenecks, and how product requirements translate into resilient systems. You’ll work across AWS, Kubernetes, Terraform, CI/CD, observability, and backend services to make HUD faster, more reliable, cheaper to run, and easier for engineers to build on.

Responsibilities
  • Own production uptime, latency, provisioning speed, infrastructure cost, and incident response for core platform services

  • Build and maintain AWS infrastructure with Terraform, Kubernetes/EKS, Helm, Docker, EC2, CodeBuild, ECR, S3, IAM, networking, and secrets management

  • Design and improve backend and platform systems for scale, including capacity planning, autoscaling, queueing, backpressure, cleanup jobs, retries, and rollback paths

  • Define and improve dashboards, alerts, logs, traces, SLOs, runbooks, and on-call workflows so failures are detected, debugged, and resolved quickly

  • Build reliable CI/CD, release automation, environment management, and deployment workflows that improve developer productivity and reduce production risk

  • Write clean, maintainable code where needed to automate systems, improve backend services, and create internal tooling

Experience

You may be a good fit if you:

  • Have owned production cloud infrastructure for a high-availability, user-facing platform, with responsibility for uptime, performance, deployment safety, and cost

  • Have deep experience with AWS infrastructure and containerized systems; experience with tools like Terraform, Kubernetes/EKS, Docker, EC2, CodeBuild, ECR, S3, IAM, load balancers, networking, and secrets management is strongly preferred

  • Have built or operated CI/CD, environment management, release automation, observability, alerting, and incident response systems

  • Have strong backend engineering judgment and can reason about service architecture, APIs, databases, async systems, queues, scaling limits, and production failure modes

  • Can write clean, maintainable code and apply strong software engineering judgment across product architecture, infrastructure, backend systems, and developer workflows

Strong candidates may also have:

  • Experience operating infrastructure for data-heavy, ML/AI, workflow, marketplace, developer-tools, or enterprise platforms

  • Experience designing systems for bursty workloads, long-running jobs, sandboxed execution, distributed workers, or high-concurrency services

  • Experience reducing cloud spend through better architecture, autoscaling, workload placement, caching, cleanup systems, or observability

  • Experience building internal platforms or tools that make engineers faster without hiding too much complexity

We prioritize technical aptitude, ownership, and learning potential over years of experience.

Team & company details
  • Team Size: ~15 people currently, mostly full-time in-person, but some remote.

  • Our team: Our team includes 4 International Olympiad medalists (IOI, ILO, IPhO), serial AI startup founders, and researchers with publications at ICLR, NeurIPS, etc.

  • Company stage: We have 8 figures in funding and high revenue growth. We’re scaling profitably and quickly to meet very strong demand.

Logistics
  • Employment: Full-time.

  • Location: On-site in the San Francisco Bay Area.

  • Visa Sponsorship: We provide support for relocation and visas for strong full-time candidates to the US.

  • Timeline: Applications are rolling. The process is 2 technical interviews and a 1-week work trial.

What we offer
  • Competitive compensation based on experience and location

  • 100% covered top-of-the-line medical, dental, and vision from Blue Shield of CA

  • Lunch and dinner when you’re in the office

  • Company-wide holiday break (Christmas Eve to New Year’s Day) on top of PTO and paid holidays

  • Other perks including an Equinox membership, 401k, and commuter benefits

  • Unlimited* access to tokens for ChatGPT, Claude Code, Cursor, etc. *By unlimited, we mean no one on our token usage leaderboard has ever hit a limit. So we have no idea what the limit is.

Due to high volume, we may not actively respond to every application, but feel free to contact us at [email protected] or elsewhere if we missed your application!

Skills Required

  • Owned production cloud infrastructure for high-availability, user-facing platforms with responsibility for uptime, performance, deployment safety, and cost
  • Deep experience with AWS infrastructure and containerized systems (Terraform, Kubernetes/EKS, Docker, EC2, CodeBuild, ECR, S3, IAM, load balancers, networking, secrets management)
  • Built or operated CI/CD, environment management, release automation, observability, alerting, and incident response systems
  • Strong backend engineering judgment; reason about service architecture, APIs, databases, async systems, queues, scaling limits, and production failure modes
  • Ability to write clean, maintainable code to automate systems, improve backend services, and create internal tooling
  • Experience operating infrastructure for data-heavy, ML/AI, workflow, marketplace, developer-tools, or enterprise platforms
  • Experience designing systems for bursty workloads, long-running jobs, sandboxed execution, distributed workers, or high-concurrency services
  • Experience reducing cloud spend through architecture, autoscaling, workload placement, caching, cleanup systems, or observability
  • Experience building internal platforms or tools to improve engineers' productivity
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
10 Employees
Year Founded: 2025

What We Do

The all-in-one platform for evaluations on computer use and browser use AI agents.

Similar Jobs

Jellyfish Logo Jellyfish

Platform Engineer

Big Data • Cloud • Productivity • Software • Database • Analytics • Automation
Remote or Hybrid
United States
225 Employees
150K-230K Annually

Capital One Logo Capital One

Platform Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
8 Locations
55000 Employees
150K-205K Annually

Arch Systems Inc. Logo Arch Systems Inc.

Platform Engineer

Artificial Intelligence • Internet of Things • Machine Learning • Software • Analytics • Industrial • Manufacturing
In-Office
Palo Alto, CA, USA
85 Employees
170K-218K Annually

CrowdStrike Logo CrowdStrike

Platform Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
USA
10000 Employees
140K-215K Annually

Similar Companies Hiring

Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account