Inference Optimization ML Engineer

Posted Yesterday
Mountain View, CA, USA
In-Office
Mid level
Artificial Intelligence • Computer Vision • Hardware • Robotics
The Role
Optimize inference performance of large multimodal foundation models across cloud and on-robot targets. Diagnose bottlenecks, apply quantization/pruning/distillation, tune kernels (CUDA/Triton), build benchmarking and regression detection, and translate research models into deployment-ready implementations.
Summary Generated by Built In

At Rhoda AI, we’re building the next generation of generalist intelligent robots. We own the full robotics stack from high-performance hardware and robot systems to the infrastructure and state-of-the-art foundation world models that control our robots. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling long-tail edge cases, made possible by our cutting edge research and end-to-end system design. We've raised over $400M and are investing aggressively in model research, infrastructure, hardware development, and manufacturing scale-up to make generalist robotics a reality.

We're looking for an Inference Optimization MLE to help build and operate the systems that make our foundation models run fast and efficiently in production. You'll be responsible for squeezing maximum performance out of large multimodal models, across cloud and on-robot deployment targets. You will working closely with research and robotics teams to close the gap between training and real-world deployment.

What You'll Do

  • Own inference performance end-to-end — diagnose and improve latency, throughput, and efficiency of large foundation models in production

  • Build systematic performance attribution: latency decomposition (compute vs. memory bandwidth vs. I/O), bottleneck identification, and prioritization across model families

  • Apply and develop optimization techniques including quantization, pruning, distillation, operator fusion, and model compilation (e.g., TensorRT, torch.compile, XLA)

  • Optimize attention mechanisms, KV caching, and memory layouts for large multimodal models (vision, video, language, proprioception)

  • Work with kernel-level tooling (e.g., CUDA, Triton) to identify hotspots and implement or tune custom kernels where needed

  • Build benchmarking and regression detection infrastructure: latency baselines, throughput curves, and automated detection of performance regressions across model versions

  • Collaborate closely with research engineers to translate model innovations into optimized, deployment-ready implementations

What We're Looking For

  • 3+ years of experience in inference optimization, ML systems, or a closely related field

  • Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)

  • Strong understanding of compute, memory bandwidth, and I/O bottlenecks in large model inference

  • Experience with model optimization techniques: quantization (INT8/FP8/AWQ), distillation, pruning, and compilation

  • Familiarity with inference serving frameworks (e.g., Triton, TensorRT, vLLM, TorchServe)

  • Exceptional debugging and measurement ability: turn "inference is slow" into clear bottlenecks, experiments, and validated improvements

  • High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

  • GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)

  • Experience with multimodal or video model inference (variable-length sequences, packing/bucketing)

  • Familiarity with edge/cloud hybrid deployment patterns and on-robot inference constraints

  • Experience with speculative decoding, continuous batching, or other LLM serving optimizations

  • Background in streaming or low-latency systems relevant to real-time robot control

Why This Role

  • Direct leverage on research velocity and real-world robot performance — every efficiency gain you make accelerates model iteration and tightens the loop between model and robot behavior

  • Own the optimization layer that determines how quickly and efficiently our foundation models run in the real world — high ownership, high impact, small elite team

Skills Required

  • 3+ years of experience in inference optimization, ML systems, or closely related field
  • Hands-on experience with PyTorch
  • Experience with JAX
  • Strong understanding of compute, memory bandwidth, and I/O bottlenecks in large model inference
  • Experience with model optimization techniques: quantization (INT8/FP8/AWQ), distillation, pruning, and compilation
  • Familiarity with inference serving frameworks (Triton, TensorRT, vLLM, TorchServe)
  • Exceptional debugging and measurement ability for inference performance
  • High ownership mindset and comfort in a fast-moving environment
  • GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
  • Experience with multimodal or video model inference (variable-length sequences, packing/bucketing)
  • Familiarity with edge/cloud hybrid deployment and on-robot inference constraints
  • Experience with speculative decoding, continuous batching, or other LLM serving optimizations
  • Background in streaming or low-latency systems relevant to real-time robot control
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
73 Employees
Year Founded: 2024

What We Do

Rhoda AI builds robot foundation models that learn from internet-scale video to enable manipulation-capable robots to generalize in real-world industrial environments. Using a Direct Video Action architecture and its FutureVision intelligence layer, Rhoda focuses on turnkey deployments in manufacturing, logistics, and e-commerce—aiming to move robots out of controlled labs and into reliable, adaptive production settings.

Similar Jobs

Unity Logo Unity

Machine Learning Engineer

AdTech • Artificial Intelligence • Gaming • Machine Learning • Software • Virtual Reality • Metaverse
Hybrid
Mountain View, CA, USA
4500 Employees
278K-348K Annually

Wipfli Logo Wipfli

Transaction Advisory Services Manager

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
United States
3000 Employees
117K-158K Annually

Wipfli Logo Wipfli

Director - Transaction Advisory Services

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
United States
3000 Employees
142K-191K Annually

CrowdStrike Logo CrowdStrike

Infrastructure Engineer

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
USA
10000 Employees
140K-215K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account