Machine Learning Infrastructure Engineer

Posted 2 Days Ago
San Jose, CA, USA
In-Office
140K-165K Annually
Mid level
Big Data • Information Technology
The Role
Build and operate ML infrastructure for LLMs and agent systems: model gateways, routing, serving, telemetry, observability, evaluation, cost-aware routing, dashboards, and APIs/SDKs to enable reliable production AI at scale.
Summary Generated by Built In

Astera Labs (NASDAQ: ALAB) provides rack-scale AI infrastructure through purpose-built connectivity solutions. By collaborating with hyperscalers and ecosystem partners, Astera Labs enables organizations to unlock the full potential of modern AI. Astera Labs’ Intelligent Connectivity Platform integrates CXL®, Ethernet, NVLink, PCIe®, and UALink™ semiconductor-based technologies with the company’s COSMOS software suite to unify diverse components into cohesive, flexible systems that deliver end-to-end scale-up, and scale-out connectivity. The company’s custom connectivity solutions business complements its standards-based portfolio, enabling customers to deploy tailored architectures to meet their unique infrastructure requirements. Discover more at www.asteralabs.com.


Machine Learning Infrastructure Engineer

Location: San Jose, CA
Experience: 1–5 years
Team: Applied AI

The role

We’re hiring a Machine Learning Infrastructure Engineer to build the runtime, platform, and operational backbone for modern AI systems. This role is for someone who wants to work on the systems behind the systems: model access layers, routing, serving paths, telemetry, observability, evaluation infrastructure, and the controls needed to make fast-moving AI work reliable in practice.


This is a platform role, but not in the old sense. The work is tightly coupled to how modern AI systems are actually built and used: multiple model providers, agent runtimes, skill and tool layers, inference telemetry, cost-aware routing, AI spend visibility, and governance that is strong enough for real internal adoption.


What you’ll do
  • Build and improve internal AI infrastructure for LLM applications, agents, retrieval systems, and model-backed engineering workflows.
  • Own inference deployment paths across managed and self-serve environments, including access control, monitoring, and operational reliability.
  • Build platform layers such as model gateways, routing, runtime integrations, telemetry, and controls for safe execution at scale.
  • Develop AI Ops capabilities across evaluation, release readiness, observability, incident triage, regression detection, and cost monitoring.
  • Build dashboards, tracing, logging, and alerting for production AI systems, including spend and usage visibility across tools and teams.
  • Improve performance and unit economics through routing, caching, batching, failover, and latency/cost optimization.
  • Create reusable APIs, SDKs, and platform abstractions that make AI systems easier to deploy, evaluate, govern, and operate.
What we’re looking for
  • 1–5 years of experience in software engineering, ML infrastructure, MLOps, platform engineering, or related backend/infrastructure roles.
  • Strong Python plus strong systems instincts.
  • Experience with AWS or GCP and real production service ownership.
  • Familiarity with inference deployments, model APIs, gateways, serving systems, or runtime infrastructure for LLM/ML workloads.
  • Experience with observability, telemetry, reliability engineering, and incident response.
  • Understanding of eval systems, release workflows, retrieval-backed systems, and debugging non-deterministic AI behavior.
  • Ability to translate messy platform needs into scalable internal infrastructure.
What strong candidates often look like

They have built or operated systems where latency, routing, cost, telemetry, and reliability actually matter. They understand that modern AI infrastructure is not just about getting a model endpoint running. It is about building the runtime, visibility, controls, and developer experience that let an applied AI team move fast without losing quality or trust.


Why this role is interesting

The team is building AI-ready infrastructure in the most literal sense: observability, access control, AI spend tracking, secure managed platforms, skill/tool infrastructure, and telemetry that spans requests, tools, models, and outcomes. If you want to work on the platform layer that makes modern agentic systems possible — and do it in a setting where the downstream users are serious engineers with high expectations — this is that role.


The base pay compensation range for this role is between $140,000 - $165,000

We know that creativity and innovation happen more often when teams include diverse ideas, backgrounds, and experiences, and we actively encourage everyone with relevant experience to apply, including people of color, LGBTQ+ and non-binary people, veterans, parents, and individuals with disabilities.

Skills Required

  • 1-5 years experience in software engineering, ML infrastructure, MLOps, platform engineering, or related backend/infrastructure roles.
  • Strong Python and strong systems instincts.
  • Experience with AWS or GCP and real production service ownership.
  • Familiarity with inference deployments, model APIs, model gateways, serving systems, or runtime infrastructure for LLM/ML workloads.
  • Experience with observability, telemetry, reliability engineering, and incident response.
  • Understanding of evaluation systems, release workflows, retrieval-backed systems, and debugging non-deterministic AI behavior.
  • Ability to translate messy platform needs into scalable internal infrastructure.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
148 Employees
Year Founded: 2017

What We Do

Astera Labs Inc., a fabless semiconductor company headquartered in the heart of California’s Silicon Valley, is a leader in purpose-built connectivity solutions for data-centric systems throughout the data center. Partnering with leading processor vendors, cloud service providers, seasoned investors, and world-class manufacturing companies, Astera Labs is helping customers remove performance bottlenecks in data-intensive systems that are limiting the true potential of applications such as artificial intelligence and machine learning. The company’s product portfolio includes system-aware semiconductor integrated circuits, boards, and services to enable robust CXL, PCIe, and Ethernet connectivity.

Similar Jobs

General Motors Logo General Motors

Machine Learning Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Remote or Hybrid
3 Locations
165000 Employees
185K-335K Annually

Snap Inc. Logo Snap Inc.

Software Engineer

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Hybrid
2 Locations
5000 Employees
133K-235K Annually

Whatnot Logo Whatnot

Software Engineer

eCommerce • Mobile • Retail
In-Office
4 Locations
1200 Employees
190K-300K Annually

Nuro Logo Nuro

Senior Software Engineer

Artificial Intelligence • Automotive • Information Technology • Robotics
In-Office
Mountain View, CA, USA
908 Employees
194K-291K Annually

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account