Periodic Labs

ML Systems Engineer

Reposted 4 Days Ago

Be an Early Applicant

Menlo Park, CA, USA

In-Office

300K-400K Annually

Expert/Leader

Artificial Intelligence • Hardware • Information Technology • Robotics

From bits to atoms.

The Role

The ML Systems Engineer will design and manage efficient training and inference systems, optimize hardware utilization, and collaborate with researchers on RL loop integration, enhancing scientific discovery.

Summary Generated by Built In

About Periodic Labs

We're an AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials, energy, and beyond. Backed by world-class investors and growing rapidly, we operate at the pace the frontier requires. Our team brings deep expertise, genuine ownership, and an insatiable drive to push the boundaries of what's scientifically possible.

About the Role

You’ll work alongside some of the world’s leading ML systems engineers, including leaders behind Megatron-LM, SGLang, Liger Kernel, TorchRec, CleanRL, TorchRL, and JAX-MD.

We’re looking for exceptional ML Systems Engineers to build the agentic infrastructure powering our large-scale training, inference, and reinforcement learning. You’ll own critical pieces of the ML systems stack to maximize performance, scalability, reliability, and productivity for both engineers and AI agents.

What You'll Do

Build and optimize large-scale training and reinforcement learning infrastructure while ensuring its correctness
Develop high-performance inference and serving systems
Design distributed runtimes and scheduling systems for complex ML workloads
Build secure and large-scale sandboxing and execution environments
Optimize memory, GPU kernels and communication for maximum throughput and end-to-end efficiency
Improve scalability, reliability, and efficiency across the ML systems stack

What We're Looking For

Strong systems programming and performance engineering skills
Experience building high-performance ML infrastructure at scale
Ability to own complex technical problems end-to-end
Strong coding ability and engineering judgment, including the ability to work effectively with AI agents to design, implement, test, and debug complex systems
High ownership, fast execution, and a passion for pushing the frontier of AI systems and accelerating scientific discovery

You should have deep expertise in at least one of the following:

Training: Strong experience building, debugging and optimizing large-scale training systems with Megatron-LM. Familiarity with TorchTitan, FSDP, veRL, Slime, or other distributed training systems is a plus.
Distributed Runtime: Strong experience with Ray. Familiarity with Monarch or other distributed execution frameworks is a plus.
Inference: Strong experience with SGLang. Familiarity with vLLM, TensorRT-LLM, or production LLM serving systems is a plus.
Sandboxing: Strong experience with secure execution environments, containers, virtualization, or code sandboxing.
GPU Kernels: Strong experience with CUDA, Triton, CUTLASS, CuTe, or custom GPU kernel development.
GPU Communication: Strong experience with NCCL, NVLink, InfiniBand, RDMA, GPUDirect RDMA, or large-scale communication optimization.

Mechanics

Minimum education: Bachelor’s degree or similar experience

Location: Menlo Park, CA (Soon: San Francisco, too)

Compensation: $250,000-$350,000 base + equity

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

We’re building a team of the world’s best — the scientists, engineers, and problem-solvers who don’t just follow the frontier, they define it. If you’re driven to bring AI to life in the physical world and make discoveries that have never been made before, you belong here.

Skills Required

Experience with large-scale inference infrastructure and production-level serving architecture
Expertise in low-level systems programming and optimization
Proficiency in GPU cluster scheduling and orchestration
Ability to write and optimize CUDA kernels
Experience in profiling and benchmarking distributed ML systems
Familiarity with checkpoint management and cloud storage integration
Experience contributing to open source ML infrastructure projects
Experience in ML algorithm-infrastructure co-design

View all jobs at Periodic Labs

View Periodic Labs Profile

Report Job

Am I A Good Fit?

beta