ML Platform Engineer

Reposted 15 Days Ago
Austin, TX, USA
In-Office
Mid level
Information Technology • Robotics
Let us do the driving
The Role
The ML Platform Engineer will build scalable architecture for ML training workloads, optimize system performance, and collaborate with teams for enhanced efficiency on Kubernetes.
Summary Generated by Built In
About the team

The ML Platform team at Avride builds the infrastructure that powers large-scale ML training and data processing for autonomous driving. We sit between Cloud Platform and ML engineers, turning low-level compute, storage, and networking primitives into an ML platform that teams actually use — scalable orchestration, distributed compute, and production-grade tooling for the full model lifecycle.


About the role

As an ML Platform Engineer at Avride, you'll own critical pieces of the ML stack: workflow orchestration, distributed execution, resource governance, performance.You will shape how ML teams across the company run experiments and train models at scale. You will build the abstractions and services that make training workloads reliable, cost-efficient, and fast, helping ML teams run at scale on Kubernetes with strong reliability and excellent developer experience.


What you will do
  • Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration
  • Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance — scheduling, priorities, quotas, and policy enforcement across GPU, CPU, memory, and IO
  • Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention
  • Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes
  • Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs

What you will need
  • Strong proficiency in Python or Go; C++ is a plus
  • Track record of designing and building scalable, maintainable systems and services
  • Experience operating production services end-to-end: APIs, reliability practices, observability
  • Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure
  • Solid Linux and systems debugging skills: performance investigation, networking, storage/IO
  • Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution

Nice to have
  • Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling
  • Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines
  • Track record of optimizing resource usage and performance in distributed environments

Candidates are required to be authorized to work in the U.S. The employer is not offering relocation sponsorship, and remote work options are not available.

Avride is an equal opportunity employer and committed to providing reasonable accommodations to qualified applicants and employees with disabilities to ensure they have equal access to employment opportunities. Avride complies with the Americans with Disabilities Act (ADA), if you need a reasonable accommodation to assist with the application or hiring process, or to perform the essential functions of a job, please email [email protected].

Skills Required

  • Strong proficiency in Python or Go; C++ is a plus
  • Track record of designing and building scalable, maintainable systems and services
  • Experience operating production services end-to-end: APIs, reliability practices, observability
  • Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle behave under pressure
  • Solid Linux and systems debugging skills: performance investigation, networking, storage/IO
  • Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Austin, TX
236 Employees
Year Founded: 2020

What We Do

Avride is a leading developer in the autonomous vehicle and delivery robot industry. Our dynamic team, composed of a few hundred engineers develops and operates autonomous cars and delivery robots across the globe, shaping the future of mobility and logistics. At Avride, we are committed to making the roads safer and more accessible for everyone. At the core of our philosophy is the belief in the transformative power of technology. Every product we develop, every test we conduct, and every service we launch is anchored in our vision of creating a safer and more sustainable world with help of cutting-edge technologies and breakthrough solutions

Similar Jobs

General Motors Logo General Motors

Systems Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
2 Locations
165000 Employees
233K-340K Annually

General Motors Logo General Motors

Senior ML Inference Engineer - Platform

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Remote or Hybrid
3 Locations
165000 Employees
129K-261K Annually

General Motors Logo General Motors

Infrastructure Engineer

Automotive • Big Data • Information Technology • Robotics • Software • Transportation • Manufacturing
Hybrid
4 Locations
165000 Employees
155K-206K Annually
Hybrid
6 Locations
141 Employees
175K-190K Annually

Similar Companies Hiring

Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Fairly Even Thumbnail
Hardware • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account