Senior Performance Engineer- Pre-training(f/m/d)

Reposted 20 Days Ago
Be an Early Applicant
Heidelberg, Baden-Württemberg, DEU
Hybrid
Senior level
Artificial Intelligence • Information Technology • Internet of Things
The Role
Engineer and optimize large-scale foundation model pretraining pipelines. Profile end-to-end training, eliminate system and kernel bottlenecks, design and tune distributed parallelism strategies, and collaborate with researchers to make architectures hardware-efficient for high-throughput LLM training on GPU clusters.
Summary Generated by Built In
Our Mission

Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers - in finance, manufacturing, public administration - need models that understand German, meet European regulatory requirements, and work reliably in high-stakes settings. We're building that in Heidelberg.

We are hiring a Performance Engineer to grow our pre-training efficiency team. If you are excited about making models fast, this is the role for you!

Team Culture

At Aleph Alpha, we foster a culture built on ownership, autonomy, and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organizational structure with efficient, supportive management that enables quick decision‑making, open communication, and a strong sense of shared purpose.

About the role:

You will engineer the systems required to train foundation models at scale. Your objective is to maximize hardware utilization and training throughput on our large-scale GPU clusters (thousands of NVIDIA Blackwell GPUs). You will work at the intersection of deep learning frameworks, distributed systems, and GPU microarchitecture, eliminating bottlenecks from the Python layer down to the GPU kernel.

This role is for Aleph Alpha Research GmbH.

Your responsibilities:

  • End-to-End Optimization: Profile training loops using PyTorch Profiler, Nsight Systems and Nsight Compute to identify system- and kernel-level bottlenecks in order to maximize model throughput.

  • Distributed Strategy and Topology: Configure and tune composite parallelism strategies (e.g. TP, DP, HSDP/FSDP, EP), optimizing load balance, minimizing critical-path bottlenecks, and managing communication-to-computation trade-offs for large-scale LLM training.

  • Hardware-Aware Modeling: Partner with AI Researchers to define model architectures for hardware efficiency without compromising convergence.

Your Profile

Basic Qualifications

  • Are proficient in Python and the PyTorch library.

  • Have a strong engineering background in parallel and/or distributed systems with proven track record of excellence.

  • Have hands-on experience with modern machine learning techniques (especially large language models and their life cycle).

  • Deeply understand the CUDA programming model.

  • Have experience in distributed programming with APIs like NCCL or MPI.

  • Have experience analysing profiling traces with tools such as PyTorch Profiler and Nvidia Nsight.

  • Please note this role requires regular on-site collaboration in Heidelberg as a member of the Training Efficiency Team.

Preferred Qualifications

  • Contributions to modern distributed training frameworks (e.g., TorchTitan, Megatron-LM, DeepSpeed).

  • Familiarity with low-precision training formats (MXFP4, MXFP8) and their impact on numerical stability and throughput.

  • A deep understanding of NCCL communication primitives, NVSHMEM or CUDA IPC and their performance.

  • A proven track record of implementing and optimising modern transformer-based model training.

  • A proven track record working on the NVIDIA Blackwell architecture.

Compensation and Benefits
  • Become part of an AI revolution!

  • 30 days of paid vacation

  • Access to a variety of fitness & wellness offerings via Wellhub

  • Mental health support through nilo.health

  • Substantially subsidized company pension plan for your future security

  • Subsidized Germany-wide transportation ticket

  • Budget for additional technical equipment

  • Flexible working hours for better work-life balance and hybrid working model

  • Virtual Stock Option Plan

  • JobRad® Bike Lease

Skills Required

  • Proficient in Python and the PyTorch library.
  • Strong engineering background in parallel and/or distributed systems.
  • Hands-on experience with modern machine learning techniques, especially large language models.
  • Deep understanding of the CUDA programming model.
  • Experience in distributed programming with APIs like NCCL or MPI.
  • Experience analysing profiling traces with tools such as PyTorch Profiler and Nvidia Nsight.
  • Regular on-site collaboration in Heidelberg as a member of the Training Efficiency Team.
  • Contributions to modern distributed training frameworks (e.g., TorchTitan, Megatron-LM, DeepSpeed).
  • Familiarity with low-precision training formats (MXFP4, MXFP8) and numerical stability implications.
  • Deep understanding of NCCL communication primitives, NVSHMEM or CUDA IPC and their performance.
  • Proven track record implementing and optimising transformer-based model training.
  • Proven track record working on the NVIDIA Blackwell architecture.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Baden-Württemberg
254 Employees
Year Founded: 2019

What We Do

We are an AI research and application company that researches, develops and operationalises large-scale AI models for language, image data and strategy, thereby contributing to securing Europe's digital sovereignty

Similar Jobs

Zscaler Logo Zscaler

Regional Director, Enterprise

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Remote or Hybrid
Germany
8697 Employees
100K-143K Annually

Navan Logo Navan

Consultant

Fintech • Information Technology • Payments • Productivity • Software • Travel • Automation
Easy Apply
Remote or Hybrid
Germany
3300 Employees

Magna International Logo Magna International

Produktionsplaner (m/w/d)

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Hybrid
Neuenstadt am Kocher, Baden-Württemberg, DEU
171000 Employees

Samsara Logo Samsara

Sales Manager

Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
Easy Apply
Remote or Hybrid
Germany
4000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
31 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account