Rhoda AI

Research Scientist / Engineer - Training Systems

Posted Yesterday

Be an Early Applicant

Mountain View, CA, USA

In-Office

Expert/Leader

Artificial Intelligence • Computer Vision • Hardware • Robotics

The Role

Lead design and implementation of large-scale multimodal training systems. Diagnose and optimize compute, communication, and memory bottlenecks across thousands of GPUs; define parallelism strategies; build observability and regression detection tools; collaborate with researchers and infra to improve distributed efficiency and scaling for robotics world models.

Summary Generated by Built In

At Rhoda AI, we’re building the next generation of generalist intelligent robots. We own the full robotics stack from high-performance hardware and robot systems to the infrastructure and state-of-the-art foundation world models that control our robots. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling long-tail edge cases, made possible by our cutting edge research and end-to-end system design. We've raised over $400M and are investing aggressively in model research, infrastructure, hardware development, and manufacturing scale-up to make generalist robotics a reality.

We're looking for a Staff / Principal ML Training Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale — driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.

What You'll Do

Own training performance end-to-end

Diagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)
Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritization
Drive measurable gains in:
- Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)
- Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)
- Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)

Design training systems (not just tune them)

Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approaches
Improve execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvements
Contribute to and extend training frameworks where needed

Make performance observable and measurable

Establish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiency
Build tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurations
Develop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressions

Partner deeply with researchers

Work side-by-side with research scientists and research engineers — no silos
Translate model innovations into scalable, efficient implementations
Advise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length data

Collaborate on cluster-level efficiency

Work with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behavior

What We're Looking For

Proven track record improving large-scale distributed training performance
Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)
Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clusters
Strong systems intuition — ability to reason across compute, communication, and memory bottlenecks
Exceptional debugging and measurement ability: turn "training is slow" into clear bottlenecks, experiments, and validated improvements
High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
Experience with multimodal or video training (variable-length sequences, packing/bucketing)
Experience working on large-scale training frameworks or distributed runtimes
Familiarity with cluster topology, networking, and large-scale scheduling effects

Why This Role

Direct leverage on research velocity — every efficiency gain you make accelerates model iteration across the entire research team
Own the scalability and performance of large-scale multimodal training for real-world embodied intelligence, not static benchmarks
Improvements you make compound across every training run the company executes — high ownership, high impact, small elite team

Skills Required

Proven track record improving large-scale distributed training performance.
Deep hands-on experience with modern ML stacks (PyTorch required).
Strong understanding of data, tensor, pipeline parallelism and sharded training (FSDP / ZeRO-style).
Experience reasoning across compute, communication, and memory bottlenecks (systems intuition).
Exceptional debugging and measurement ability for performance attribution and bottleneck validation.
High ownership mindset and comfort in a fast-moving environment.
Experience with JAX.

View all jobs at Rhoda AI

View Rhoda AI Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

73 Employees

Year Founded: 2024

What We Do

Rhoda AI builds robot foundation models that learn from internet-scale video to enable manipulation-capable robots to generalize in real-world industrial environments. Using a Direct Video Action architecture and its FutureVision intelligence layer, Rhoda focuses on turnkey deployments in manufacturing, logistics, and e-commerce—aiming to move robots out of controlled labs and into reliable, adaptive production settings.