Research Engineer - Training Platform

Posted Yesterday
Be an Early Applicant
Mountain View, CA, USA
In-Office
Mid level
Artificial Intelligence • Computer Vision • Hardware • Robotics
The Role
Build and maintain large-scale training orchestration and experiment management systems for distributed GPU model training. Implement observability, scheduling, artifact management, and tooling to optimize research iteration, cluster utilization, and reliability while collaborating with research and infra teams.
Summary Generated by Built In

At Rhoda AI, we’re building the next generation of generalist intelligent robots. We own the full robotics stack from high-performance hardware and robot systems to the infrastructure and state-of-the-art foundation world models that control our robots. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling long-tail edge cases, made possible by our cutting edge research and end-to-end system design. We've raised over $400M and are investing aggressively in model research, infrastructure, hardware development, and manufacturing scale-up to make generalist robotics a reality.

We're looking for a Research Engineer to build and maintain the training platform that powers our model development — experiment orchestration, job management, observability, and the tooling that lets researchers move from idea to result as fast as possible.

What You'll Do

  • Build and maintain training orchestration systems for large-scale distributed model training across GPU clusters

  • Develop experiment management tooling: job configuration, tracking, reproducibility, and artifact management

  • Build observability infrastructure for training runs: loss curves, compute utilization, gradient statistics, and anomaly detection

  • Optimize and automate the research iteration loop from experiment launch to results analysis

  • Manage job scheduling and cluster utilization for efficient use of GPU compute

  • Build internal tooling and interfaces that help researchers move faster

  • Collaborate with training systems, data infrastructure, and research teams to support their platform needs

What We're Looking For

  • Strong software engineering skills with experience in MLOps or ML platform engineering

  • Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)

  • Experience building experiment tracking, reproducibility, and artifact management systems

  • Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)

  • Strong reliability engineering instincts: monitoring, alerting, and failure recovery

Nice to Have (But Not Required)

  • Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)

  • Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)

  • Experience supporting large model training pipelines (LLMs, VLMs, or video models)

  • Understanding of parallelism strategies and how they affect training efficiency and debugging

  • Experience with cloud-based training infrastructure (AWS, GCP, or Azure)

Why This Role

  • Your platform is the daily tool every researcher and engineer uses to train models

  • Improvements to training velocity and reliability compound across every experiment the team runs

  • High visibility with direct feedback from researchers and ML engineers

  • Build systems that scale from today's models to future frontier training runs

Skills Required

  • Strong software engineering skills with experience in MLOps or ML platform engineering
  • Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)
  • Experience building experiment tracking, reproducibility, and artifact management systems
  • Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)
  • Reliability engineering instincts: monitoring, alerting, and failure recovery for training systems
  • Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)
  • Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)
  • Experience supporting large model training pipelines (LLMs, VLMs, or video models)
  • Understanding of parallelism strategies and debugging for distributed training
  • Experience with cloud-based training infrastructure (AWS, GCP, or Azure)
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
73 Employees
Year Founded: 2024

What We Do

Rhoda AI builds robot foundation models that learn from internet-scale video to enable manipulation-capable robots to generalize in real-world industrial environments. Using a Direct Video Action architecture and its FutureVision intelligence layer, Rhoda focuses on turnkey deployments in manufacturing, logistics, and e-commerce—aiming to move robots out of controlled labs and into reliable, adaptive production settings.

Similar Jobs

Circle Logo Circle

Senior Counsel

Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3
In-Office or Remote
25 Locations
1050 Employees
230K-298K Annually

Wipfli Logo Wipfli

Transaction Advisory Services Manager

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
United States
3000 Employees
117K-158K Annually

Wipfli Logo Wipfli

Director - Transaction Advisory Services

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
United States
3000 Employees
142K-191K Annually

CrowdStrike Logo CrowdStrike

Infrastructure Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
USA
10000 Employees
140K-215K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account