ifm Jobs

Senior Distributed Systems Engineer

ifm

Senior Distributed Systems Engineer

Reposted 4 Days Ago

Be an Early Applicant

Sunnyvale, CA, USA

In-Office

200K-400K Annually

Senior level

Information Technology • Automation • Manufacturing

The Role

Design and optimize communication stacks and runtime for large-scale distributed training (including MoE and hybrid parallelism). Improve collective performance, reduce latency, ensure fault-tolerant execution, debug NCCL/UCX/RDMA issues, and co-design topology-aware orchestration for 1,000+ GPU clusters.

Summary Generated by Built In

About the Institute of Foundation Models

The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.

This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission

We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.

This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.

· Design and optimize expert-parallel and hybrid-parallel communication patterns

· Drive high-performance hierarchical collectives for MoE workloads

· Co-design runtime orchestration with communication topology awareness

· Reduce tail latency and improve determinism across thousands of GPUs

· Architect fault-tolerant distributed execution under real-world cluster failures

Core Technical Scope

· Communication-compute overlap and topology-aware collective optimization

· Deep debugging of NCCL, RDMA, and custom communication layers

· Hybrid expert parallel strategies in modern large-scale MoE systems

· Elastic and resilient distributed job orchestration concepts

· Congestion analysis and routing optimization across InfiniBand/RoCE fabrics

· Microbenchmarking and performance modeling for communication-heavy workloads

Expected Technical Depth

· Hybrid expert parallel communication for Mixture-of-Experts training

· Scaling behavior under network pressure

· Distributed orchestration for elastic, large-scale training

· Fault detection and recovery in distributed GPU workloads

· Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler

Required Background

· Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)

· Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA

· Deep familiarity with NCCL and/or UCX internals

· Strong systems programming ability (C/C++, Rust, or Go)

· Strong familiarity with modern model training frameworks such as PyTorch

· Ability to troubleshoot and profile training performance issues related to communication bottlenecks

· Ability to translate research ideas into production-grade optimizations

· Experience debugging distributed hangs, desynchronization, and performance regressions

What We Mean by "Hardcore"

· You can explain why an communication degrades at scale and how to fix it

· You have improved real cluster throughput via communication redesign

· You can trace a distributed hang across ranks and identify the root cause

· You are comfortable working at the boundary between hardware and runtime

Application Requirements

· Include a link to your GitHub (required)

· Provide links to relevant distributed systems, HPC, or large-scale training projects

· Include a list of publications and/or public technical reports (if applicable)

· Describe the hardest distributed debugging problem you solved

· Include measurable performance improvements you have delivered

Academic Qualifications

Master’s, or Bachelor’s + 1 year of relevant experience.

Visa Sponsorship

This position is eligible for visa sponsorship.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Skills Required

Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
Deep familiarity with NCCL and/or UCX internals
Strong systems programming ability (C/C++, Rust, or Go)
Strong familiarity with modern model training frameworks such as PyTorch
Ability to troubleshoot and profile training performance issues related to communication bottlenecks
Ability to translate research ideas into production-grade optimizations
Experience debugging distributed hangs, desynchronization, and performance regressions
Include a link to your GitHub (required)
Provide links to relevant distributed systems, HPC, or large-scale training projects
Include a list of publications and/or public technical reports (if applicable)
Describe the hardest distributed debugging problem you solved
Include measurable performance improvements you have delivered
Master's degree, or Bachelor's plus 1 year of relevant experience

View all jobs at ifm

View ifm Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Essen

3,924 Employees

Year Founded: 1969

What We Do

First a passion, then an idea transformed into success – when it comes to pioneering automation and digitalisation technology, the ifm group is the ideal partner. Since its foundation in 1969, ifm has developed, produced and sold sensors, controllers, software and systems for industrial automation and for SAP-based solutions for supply chain management and shop floor integration worldwide. As one of the pioneers of Industry 4.0, ifm develops and implements consistent solutions to digitalise the entire value chain “from sensor to ERP”. Today, the second-generation family-run ifm group has more than 8,750 employees and is one of the worldwide market leaders. The group combines the internationality and innovative strength of a growing group of companies with the flexibility and close customer contact of a medium-sized company.