ML Systems Engineer

Posted 24 Days Ago
Be an Early Applicant
Menlo Park, CA, USA
In-Office
300K-400K Annually
Expert/Leader
Artificial Intelligence • Hardware • Information Technology • Robotics
From bits to atoms.
The Role
The ML Systems Engineer will design and manage efficient training and inference systems, optimize hardware utilization, and collaborate with researchers on RL loop integration, enhancing scientific discovery.
Summary Generated by Built In
About Periodic Labs

The most important scientific discoveries of our time won't happen in a traditional lab. We're an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what's scientifically possible.

About the Role

You will own the systems layer that makes our frontier model training and inference fast, efficient, and tightly coupled to the RL feedback loop that drives scientific discovery.

This is not a pure infrastructure role and it is not a pure research role — it sits exactly at their intersection. You will go deep into the stack: scheduling, kernels, RDMA, weight synchronization, and communication primitives, while working shoulder-to-shoulder with researchers to co-design the algorithms and infrastructure together.

The RL loop is central to how Periodic Labs works. Models propose experiments, experiments generate data, data feeds back into training. The speed and reliability of that loop is a direct multiplier on the pace of scientific discovery. You will own the infrastructure that makes it fast.

What You'll Do
  • Build rack and topology-aware scheduling for GB series GPUs across Ray, Slurm, and Kubernetes, minimizing latency and maximizing utilization across heterogeneous cluster configurations

  • Build online and offline profilers that surface bottlenecks across the training and inference stack and translate findings into actionable optimizations

  • Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks in large-scale training runs

  • Run methodical benchmarking to identify optimal RL training configurations across model sizes, batch strategies, and hardware topologies

  • Write and optimize communication and GPU kernels to extract maximum throughput from the hardware

  • Design and implement zero-copy RDMA weight synchronization between training and inference to keep the RL loop tight and low-latency

  • Build fast sandbox execution environments that allow rapid rollout of model-generated actions and return of rewards without blocking the training pipeline

  • Engage directly with the SGLang, Megatron, and Ray communities — contributing upstream, influencing roadmaps, and pulling in improvements that benefit Periodic Labs’ workloads

  • Work in close collaboration with RL and pretraining researchers to co-design algorithms and infrastructure together — you will shape what is possible at the research level by knowing what is achievable at the systems level, and vice versa

The net result: high-throughput, fault-tolerant training and inference systems tightly coupled with a low-latency RL feedback loop that accelerates scientific discovery at every turn.

You Might Thrive in This Role if You Have Experience With
  • Large-scale inference infrastructure: load balancing, traffic shifting, scheduling, and serving architecture at production scale

  • Low-level systems programming: RDMA, NVLink, kernel-level work, and network stack optimization

  • GPU cluster scheduling and orchestration across Ray, Slurm, or Kubernetes, with awareness of rack topology and hardware locality

  • Writing and optimizing CUDA kernels, communication primitives, or distributed training collective operations

  • Profiling and benchmarking distributed ML systems to identify and eliminate bottlenecks across compute, memory, and network

  • Checkpoint management and streaming at scale, including direct cloud storage integration

  • Building or contributing to open source ML infrastructure projects (e.g., SGLang, Megatron-LM, vLLM, Ray)

  • Working directly with ML researchers on algorithm-infrastructure co-design — you understand the research well enough to make systems decisions that serve it

Mechanics

Minimum education: Bachelor’s degree or an equivalent combination of education and training or experience

Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role

Compensation: The annual compensation range for this role - $300,00-$400,000

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.

Skills Required

  • Experience with large-scale inference infrastructure and production-level serving architecture
  • Expertise in low-level systems programming and optimization
  • Proficiency in GPU cluster scheduling and orchestration
  • Ability to write and optimize CUDA kernels
  • Experience in profiling and benchmarking distributed ML systems
  • Familiarity with checkpoint management and cloud storage integration
  • Experience contributing to open source ML infrastructure projects
  • Experience in ML algorithm-infrastructure co-design
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
32 Employees
Year Founded: 2025

What We Do

We're building AI scientists and the autonomous laboratories for them to operate.

Similar Jobs

Unity Logo Unity

Senior Back-end Engineer

AdTech • Artificial Intelligence • Gaming • Machine Learning • Software • Virtual Reality • Metaverse
Hybrid
Mountain View, CA, USA
4500 Employees
136K-237K Annually

ServiceNow Logo ServiceNow

Senior Machine Learning Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Hybrid
Mountain View, CA, USA
29000 Employees

NVIDIA Logo NVIDIA

Systems Engineer

Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
In-Office
Santa Clara, CA, USA
21960 Employees
152K-288K Annually

Unity Logo Unity

Back-end Engineer

AdTech • Artificial Intelligence • Gaming • Machine Learning • Software • Virtual Reality • Metaverse
Hybrid
Mountain View, CA, USA
4500 Employees
193K-306K Annually

Similar Companies Hiring

Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account