Machine Learning / Reinforcement Learning Infrastructure Engineer

Posted Yesterday
Be an Early Applicant
Boston, MA, USA
In-Office
Mid level
Artificial Intelligence • Computer Vision • Hardware • Logistics • Machine Learning • Robotics • Automation
The Era of Superhuman Robotics.
The Role
Design, implement, and maintain large-scale ML/RL training infrastructure: job orchestration, scheduling, checkpointing, experiment tracking, developer tooling, distributed training, and resource management for cloud compute.
Summary Generated by Built In

Eka Robotics

Eka Robotics is on a mission to build intelligence for the physical world - robots that are fast, general, and reliable. Our approach, grounded in physics, unlocks superhuman capabilities. We are defining the frontier of robotics research and deployment.

Our team consists of pioneers in robotics and machine learning. We are now hiring to scale our R&D effort. We are looking for hands-on individuals who are excited to help shape the future of robotics.

The Role

We are looking for a Reinforcement/Machine Learning Infrastructure Engineer to shape our training infrastructure. In this role, you will be responsible for designing, implementing, and maintaining the large-scale model training systems that power our next generation of robot learning.

We believe that world-class infrastructure is the foundation for moving research into production. You will focus on building an exceptional developer experience, creating intuitive and efficient tooling that our engineers and scientists love to use. Your work will directly accelerate our research cycles, making it effortless to test new ideas and scale successful experiments into production training runs. You will work closely with researchers to ensure our infrastructure scales seamlessly from prototyping to large-scale distributed training.

This is a hands-on, high-impact role at the intersection of machine learning, software engineering, and scalable infrastructure.

Responsibilities

  • Own Training Infrastructure: Design, implement, and maintain robust systems for large-scale model training, including job orchestration, scheduling, checkpointing, and experiment tracking.

  • Developer Experience & Tooling: Build streamlined, intuitive abstractions for launching, monitoring, debugging, and reproducing experiments, minimizing friction and maximizing productivity for our research teams.

  • Scale Distributed Training: Work closely with researchers to reliably scale reinforcement learning and machine learning pipelines across compute clusters.

  • Resource Management: Ensure efficient allocation and utilization of cloud-based compute resources while building the foundational systems needed for future scaling.

  • Collaborate with Researchers: Partner with the research team to understand their needs, build infrastructure that supports cutting-edge methods, guide best practices for training at scale, and contribute to core JAX model and training code.

Minimum Qualifications

  • Education: BS, MS or higher in Computer Science, Computer Engineering, Machine Learning or a related technical field.

  • Software Engineering: Strong software engineering fundamentals with a proven track record of building ML training infrastructure, internal developer platforms, or scalable systems.

  • Deep Learning Frameworks: Hands-on experience with large-scale training using JAX (preferred), PyTorch, or TensorFlow.

  • Distributed Systems: Familiarity with distributed training, multi-host setups, data pipelines, and managing workloads on cloud platforms or orchestration systems (e.g., Kubernetes, SLURM, GCP, AWS).

  • Communication & Ownership: Strong cross-functional communication skills, a deep ownership mindset, and a passion for building tools that improve the developer experience.

  • Infrastructure & DevOps: Experience building automated testing pipelines, CI/CD for ML workflows, and custom logging/telemetry stacks.

Preferred Qualifications

  • Domain Experience: Background in robotics, reinforcement learning or other machine learning systems.

  • Systems Design: Experience designing abstractions that balance researcher flexibility with system reliability.

Skills Required

  • BS, MS or higher in Computer Science, Computer Engineering, Machine Learning or related technical field
  • Proven track record building ML training infrastructure, internal developer platforms, or scalable systems
  • Hands-on experience with large-scale training using JAX, PyTorch, or TensorFlow (JAX preferred)
  • Familiarity with distributed training, multi-host setups, and data pipelines
  • Experience managing workloads on cloud platforms or orchestration systems (e.g., Kubernetes, SLURM, GCP, AWS)
  • Experience building automated testing pipelines and CI/CD for ML workflows
  • Experience building custom logging and telemetry stacks
  • Strong cross-functional communication skills and ownership mindset
  • Background in robotics, reinforcement learning, or other machine learning systems
  • Experience designing abstractions balancing researcher flexibility with system reliability

Eka Robotics Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Eka Robotics and has not been reviewed or approved by Eka Robotics.

  • Fair & Transparent Compensation Some Eka Robotics job postings publicly share salary bands for specific roles, offering a degree of clarity on base pay expectations. These disclosures provide directional insight even though not all listings include ranges.

Eka Robotics Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
20 Employees

What We Do

The Era of Superhuman Robotics.

Similar Jobs

PwC Logo PwC

(DO NOT APPLY) PTT Test 6/18

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Remote or Hybrid
8 Locations
370000 Employees
77K-214K Annually

PwC Logo PwC

Legal Process & Technology Consulting Manager

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
8 Locations
370000 Employees
99K-232K Annually

PwC Logo PwC

Supply Chain Manager

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
26 Locations
370000 Employees
99K-232K Annually

PwC Logo PwC

Strategy& Consumer Banking & Payments Consulting Senior Associate

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
13 Locations
370000 Employees
77K-202K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account