Senior Distributed Systems Engineer

Posted 10 Days Ago
Be an Early Applicant
Sunnyvale, CA, USA
In-Office
200K-400K Annually
Senior level
Information Technology • Automation • Manufacturing
The Role
Design and optimize communication stacks and runtime for large-scale distributed training (including MoE and hybrid parallelism). Improve collective performance, reduce latency, ensure fault-tolerant execution, debug NCCL/UCX/RDMA issues, and co-design topology-aware orchestration for 1,000+ GPU clusters.
Summary Generated by Built In
About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.
 
The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
·       Design and optimize expert-parallel and hybrid-parallel communication patterns
·       Drive high-performance hierarchical collectives for MoE workloads
·       Co-design runtime orchestration with communication topology awareness
·       Reduce tail latency and improve determinism across thousands of GPUs
·       Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
·       Communication-compute overlap and topology-aware collective optimization
·       Deep debugging of NCCL, RDMA, and custom communication layers
·       Hybrid expert parallel strategies in modern large-scale MoE systems
·       Elastic and resilient distributed job orchestration concepts
·       Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
·       Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
·       Hybrid expert parallel communication for Mixture-of-Experts training
·       Scaling behavior under network pressure
·       Distributed orchestration for elastic, large-scale training
·       Fault detection and recovery in distributed GPU workloads
·       Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
·       Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
·       Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
·       Deep familiarity with NCCL and/or UCX internals
·       Strong systems programming ability (C/C++, Rust, or Go)
·       Strong familiarity with modern model training frameworks such as PyTorch
·       Ability to troubleshoot and profile training performance issues related to communication bottlenecks
·       Ability to translate research ideas into production-grade optimizations
·       Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
·       You can explain why an communication degrades at scale and how to fix it
·       You have improved real cluster throughput via communication redesign
·       You can trace a distributed hang across ranks and identify the root cause
·       You are comfortable working at the boundary between hardware and runtime
Application Requirements
·       Include a link to your GitHub (required)
·       Provide links to relevant distributed systems, HPC, or large-scale training projects
·       Include a list of publications and/or public technical reports (if applicable)
·       Describe the hardest distributed debugging problem you solved
·       Include measurable performance improvements you have delivered
Academic Qualifications
Master’s, or Bachelor’s + 1 year of relevant experience.

Visa Sponsorship
This position is eligible for visa sponsorship.
 
Benefits Include
*Comprehensive medical, dental, and vision benefits 
 *Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability

Skills Required

  • Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
  • Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
  • Deep familiarity with NCCL and/or UCX internals
  • Strong systems programming ability (C/C++, Rust, or Go)
  • Strong familiarity with modern model training frameworks such as PyTorch
  • Ability to troubleshoot and profile training performance issues related to communication bottlenecks
  • Ability to translate research ideas into production-grade optimizations
  • Experience debugging distributed hangs, desynchronization, and performance regressions
  • Include a link to your GitHub (required)
  • Provide links to relevant distributed systems, HPC, or large-scale training projects
  • Include a list of publications and/or public technical reports (if applicable)
  • Describe the hardest distributed debugging problem you solved
  • Include measurable performance improvements you have delivered
  • Master's degree, or Bachelor's plus 1 year of relevant experience
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Essen
3,924 Employees
Year Founded: 1969

What We Do

First a passion, then an idea transformed into success – when it comes to pioneering automation and digitalisation technology, the ifm group is the ideal partner. Since its foundation in 1969, ifm has developed, produced and sold sensors, controllers, software and systems for industrial automation and for SAP-based solutions for supply chain management and shop floor integration worldwide. As one of the pioneers of Industry 4.0, ifm develops and implements consistent solutions to digitalise the entire value chain “from sensor to ERP”. Today, the second-generation family-run ifm group has more than 8,750 employees and is one of the worldwide market leaders. The group combines the internationality and innovative strength of a growing group of companies with the flexibility and close customer contact of a medium-sized company.

Similar Jobs

Capital One Logo Capital One

Lead Software Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
5 Locations
55000 Employees
230K-286K Annually

Capital One Logo Capital One

Lead Software Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
5 Locations
55000 Employees
230K-286K Annually

Workday Logo Workday

Development Engineer

Cloud • Fintech • HR Tech
In-Office
Pleasanton, CA, USA
14894 Employees
160K-285K Annually

Scribd, Inc. Logo Scribd, Inc.

Senior Software Engineer

Artificial Intelligence • Consumer Web • Digital Media • Software
In-Office
23 Locations
294 Employees
120K-228K Annually

Similar Companies Hiring

Fortune Brands Innovations Thumbnail
Manufacturing
Deerfield, IL
10000 Employees
Amalgamated Sugar Thumbnail
Food • Greentech • Agriculture • Industrial • Manufacturing
Boise, Idaho
768 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account