Flexion Robotics

Machine Learning Engineer

Reposted 7 Days Ago

Be an Early Applicant

Zürich, CHE

In-Office

Mid level

Artificial Intelligence • Robotics • Software

The Role

The ML Infra Engineer will develop infrastructure for training large-scale AI models, focusing on data pipelines, distributed training optimization, and system architecture. Responsibilities include designing GPU clusters and evaluating cloud services.

Summary Generated by Built In

About Flexion

At Flexion, we're building the intelligence layer powering the next generation of humanoid robots. Our mission is to accelerate the transition from fragile prototypes to real-world humanoid deployment. We are founded by leading scientists in robot reinforcement learning (ex-Nvidia, ex-ETH Zürich), and backed by leading international VC firms. In just months, we’ve gone from our first line of code to deploying real humanoid capabilities.

The Role

This is a senior ML engineering role, building out our core compute and data platforms. We’re building the brain for humanoid robots, which involves training large-scale foundational models with vast amounts of data. You'll architect the pipelines that move data from simulators and robots into model training, optimize training workloads and create the platforms that help our AI engineers train, evaluate, and iterate fast.

You'll join Flexion's experienced Infrastructure team (ex-Google, Meta, Amazon) and take significant ownership of the systems behind our data collection, training and experimentation workflows: from strategic infrastructure decisions, cluster orchestration and distributed training optimization to data platforms, CI, and experiment tooling. This is a senior, on-site role at our Zürich office.

Key Responsibilities

Architect training data platforms and pipelines: build the storage, processing, and serving layers that handle the full data lifecycle: from simulator output and robot telemetry to training datasets. This includes building infrastructure with object storage (S3), parallel filesystems (Lustre), and common data formats (Parquet, WebDataset, LeRobot). Use distributed processing frameworks (Ray, Spark) to transform and validate data at scale.
Optimize distributed training: work with our AI engineers to scale workloads across multi-node GPU clusters, profiling and improving throughput, device utilization, and communication efficiency. This includes optimizing our distributed IsaacLab-based sim-to-real training.
Evaluate and adopt new platforms and technologies: compare cloud providers, GPUaaS platforms, and emerging tooling, owning the decisions on what we adopt as we grow our compute footprint.

Requirements

Professional experience building infrastructure, tooling, or platforms for large-scale ML training workloads.
Strong experience with ML data infrastructure: distributed processing, object storage, metadata/catalog systems, dataset versioning, streaming, shuffling, caching, and high-throughput dataloading.
Hands-on experience training, scaling, or supporting distributed ML workloads, with understanding of DDP, FSDP, NCCL, checkpointing, fault tolerance, and training performance bottlenecks.
Experience with cloud infrastructure such as AWS, GCP, or similar, including compute, networking, storage, and cost/performance tradeoffs.
Experience with job scheduling or orchestration systems such as Slurm, Kubernetes, Ray, or similar.
Proficiency in Python and working knowledge of PyTorch.
Ownership mindset: comfortable making architectural decisions, setting direction, and delivering independently in a fast-moving environment.

Nice to have

Experience with job scheduling and orchestration tools: Slurm, Kubernetes, or both.
Familiarity with common data formats (Parquet, WebDataset, LeRobot).
Familiarity with robotics simulation environments (IsaacLab, IsaacGym, MuJoCo).
Experience with infrastructure-as-code and configuration management (Terraform, Ansible).
Familiarity with experiment tracking platforms (Weights & Biases, MLflow).

Benefits

Competitive compensation package
A front-row seat at one of Europe’s most ambitious robotics companies
An energetic, collaborative team with a bias for action

Skills Required

3+ years of experience with large-scale deep learning systems
Hands-on experience training large models in multi-node GPU setups
Strong experience with at least one cloud platform (AWS or GCP)
Experience with job scheduling tools (Slurm, Kubernetes)
Experience building data pipelines and managing storage
Proficiency in Python

View all jobs at Flexion Robotics

View Flexion Robotics Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Zürich

18 Employees

What We Do

We at Flexion Robotics (flexion.ai) are a young company in Zurich working on the next generation of humanoid robot software to enable robots to perform useful tasks autonomously. We work dynamically and move fast. The team is still fairly small and every new employee at this stage will have significant ownership of their current project.