Aleph Alpha Jobs

Senior AI Researcher- Pre-training (f/m/d)

Aleph Alpha

Senior AI Researcher- Pre-training (f/m/d)

Sorry, this job was removed at 10:50 a.m. (UTC) on Wednesday, Jul 29, 2026

Be an Early Applicant

Heidelberg, Baden-Württemberg, DEU

Hybrid

Senior level

Artificial Intelligence • Information Technology • Internet of Things

The Role

As a Senior AI Researcher, you will optimize training recipes, design architecture, and scale models for foundation model pre-training, ensuring model quality and stability.

Summary Generated by Built In

Our Mission

Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers — in finance, manufacturing, and public administration — need models that understand German, meet European regulatory requirements, and work reliably in high-stakes settings. We’re building that in Heidelberg.

We are hiring a Senior AI Researcher to join our Pre-training team and to advance the architecture and training of our next generation of foundation models. If you are excited about designing inference-efficient architectures, optimising training recipes that scale reliably, and training models on a large scale cluster (thousands of NVIDIA Blackwell GPUs), we would love to hear from you.

Team Culture

We foster a culture built on ownership, autonomy, and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organisational structure with efficient, supportive management that enables quick decision-making, open communication, and a strong sense of shared purpose. We collaborate closely on complex technical problems, working in pairs or using mob programming to resolve challenging issues.

About the Role

As a Senior AI Researcher in Pre-training (f/m/d), you will own the critical technical levers that determine the success of our next-generation models: architecture, optimization, stability, and scaling.

Working at the high-leverage intersection of research and engineering, you will translate mathematical reasoning and empirical observations into principled training decisions - from small-scale proxy experiments to multi-thousand-GPU runs.

We are looking for an expert who can combine rigorous experimental design with high-quality production code, directly influencing model quality, run reliability, and the efficiency of the models we ship.

Your Responsibilities

Recipe & Architecture Optimization: Own core elements of the training recipe (optimizers, schedules, initialization) and design PyTorch-based architectural improvements to maximize convergence, stability, and training efficiency.
Scaling Strategy & Predictability: Develop hyperparameter scaling laws and scale-up methodologies, using small-scale proxy experiments to reliably predict multi-thousand-GPU behavior and de-risk major training decisions.
Stability, Diagnostics & Debugging: Investigate complex convergence issues (loss spikes, divergence) and resolve hard-to-reproduce distributed system failures like communication bottlenecks, race conditions, and synchronization errors.
System-Model Co-Design: Partner with Compute Performance, Data, Evaluation, and Post-Training teams to align the model lifecycle with hardware constraints, memory bandwidth, and communication topologies.

Core Qualifications

You are proficient in Python and deeply familiar with PyTorch-based training workflows.
You have a strong track record in machine learning research and software engineering, demonstrated through shipped models, impactful open-source contributions, or published research.
You have a strong mathematical foundation and are comfortable reasoning formally about optimisation, scaling behaviour, and training dynamics.
You deeply understand transformer training dynamics, optimisation, and the behaviour of large distributed training jobs.
You can design rigorous experiments, reason clearly from noisy results, and translate empirical observations into robust training decisions.
Hands-on experience pre-training large models (e.g., 7B+ parameters) on substantial infrastructure (e.g., 100+ GPU clusters).

You apply strong software engineering practices, including writing maintainable, well-tested code and supporting reproducible experimentation workflows.
You are able to implement complex model architectures efficiently and reliably and to debug complex issues across model code, training dynamics, and distributed systems.
You collaborate effectively within a research and engineering team and communicate clearly about your work across Pre-training and the broader AAR/AA organization.
You are able to work in Germany and collaborate regularly on site in Heidelberg as part of the Pre-training team.

Preferred Qualifications

(We encourage you to apply even if you don't check every box!)

Large-Scale Training: Hands-on experience training LLMs or multimodal models on large GPU clusters using distributed frameworks (e.g., Megatron-LM, DeepSpeed, torchtitan).
Predictive Scaling: Familiarity with scaling laws, hyperparameter transfer, or methods for predicting large-scale training behavior from smaller proxy runs.
Stability & Performance: Experience profiling distributed jobs and diagnosing training anomalies like loss spikes, numerical instability, or optimizer pathologies.
Advanced Architectures: Exposure to sparse training approaches (e.g., Mixture-of-Experts) and an understanding of their routing and systems trade-offs.
Track Record of Impact: Demonstrated research excellence through top-tier publications (NeurIPS, ICML, ICLR), impactful open-source contributions, or significant shipped technical work.
Systems Curiosity: Low-level kernel optimization is not required, but we highly value a strong curiosity about the hardware and systems constraints that shape scale.

What we offer

Become part of an AI revolution!
30 days of paid vacation
Access to a variety of fitness & wellness offerings via Wellhub
Mental health support through nilo.health
Substantially subsidized company pension plan for your future security
Subsidized Germany-wide transportation ticket
Budget for additional technical equipment
Flexible working hours for better work-life balance and hybrid working model
JobRad® Bike Lease

Skills Required

Proficient in Python and familiar with PyTorch-based training workflows
Strong track record in machine learning research and software engineering
Strong mathematical foundations and ability to reason about optimisation and scaling behaviour
Understand transformer training dynamics and large distributed training jobs
Experience designing rigorous experiments and translating empirical observations into robust training decisions
Strong software engineering practices, including maintainable code and reproducible workflows
Ability to implement complex model architectures and debug issues in distributed systems
Ability to collaborate effectively in a research team and communicate work clearly
Ability to work in Germany and collaborate on site

View all jobs at Aleph Alpha

View Aleph Alpha Profile

Report Job

Similar Jobs

Zscaler

Marketing Manager

Cloud • Information Technology • Security • Software • Cybersecurity

Easy Apply

Remote or Hybrid

Germany

8697 Employees

69K-98K Annually

Bose

Business Development Manager

Automotive • eCommerce • Hardware • Music • Retail • Software • Wearables

Hybrid

Stuttgart, Baden-Württemberg, DEU

2900 Employees

CrowdStrike

CAO Elite Client Success Advisor (Remote)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity

Remote or Hybrid

11000 Employees

Magna International

Payroll Specialist

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing

Hybrid

171000 Employees

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Heidelberg

254 Employees

Year Founded: 2019

What We Do

We are an AI research and application company that researches, develops and operationalises large-scale AI models for language, image data and strategy, thereby contributing to securing Europe's digital sovereignty