Senior AI Researcher - Pre-training (f/m/d)

Reposted 8 Days Ago
Be an Early Applicant
Heidelberg, Baden-Württemberg, DEU
Hybrid
Senior level
Artificial Intelligence • Information Technology • Internet of Things
The Role
As a Senior AI Researcher, you will optimize training recipes, design architecture, and scale models for foundation model pre-training, ensuring model quality and stability.
Summary Generated by Built In

Our Mission

Aleph Alpha is one of the few companies in Europe doing serious foundation model pre-training. Our customers — in finance, manufacturing, and public administration — need models that understand German, meet European regulatory requirements, and work reliably in high-stakes settings. We’re building that in Heidelberg.

We are hiring a Senior AI Researcher to join our Pre-training team and to advance the architecture and training of our next generation of foundation models. If you are excited about designing inference-efficient architectures, optimising training recipes that scale reliably, and training models on a large scale cluster (thousands of NVIDIA Blackwell GPUs), we would love to hear from you.

Team Culture

We foster a culture built on ownership, autonomy, and empowerment. Teams and individual contributors are trusted to take responsibility for their work and drive meaningful impact. We maintain a flat organisational structure with efficient, supportive management that enables quick decision-making, open communication, and a strong sense of shared purpose. We collaborate closely on complex technical problems, working in pairs or using mob programming to resolve challenging issues.

About the Role

As a Senior AI Researcher in Pre-training, you will work on the core technical problems that determine whether large-scale pre-training succeeds: architecture, optimisation, stability, and scaling up.

You will work at the intersection of model architecture, training dynamics, and large-scale distributed training, translating empirical observations into principled training decisions. From small-scale proxy experiments to multi-thousand-GPU runs, you will ensure our models converge as expected and scale efficiently.

We are looking for someone who combines significant research experience with strong engineering ability. You should be comfortable reasoning mathematically about training behaviour, designing rigorous experiments, and maintaining a high-quality production codebase.

Your work sits at high leverage: the training decisions you make directly determine model quality, run reliability, inference efficiency, and how quickly we can improve the next generation of models. You’ll have direct influence on the models we ship.

Your Responsibilities

  • Training Recipe Optimisation: Own and improve core elements of the training recipe, including optimiser settings, learning rate schedules, initialisation, regularisation, and other choices that materially affect convergence, stability, and final model quality.

  • Scaling Strategy and Hyperparameter Transfer: Develop and validate scaling strategies for models and training recipes, including hyperparameter scaling, scale-up methodology, and empirical scaling laws. You will use carefully designed experiments to predict large-scale behaviour from smaller runs and reduce uncertainty in major training decisions.

  • Model Architecture Development: Design, implement, and evaluate architectural improvements in PyTorch, with a focus on training stability, scalability, efficiency in training and inference, and overall model performance.

  • Training Stability and Diagnostics: Investigate and resolve convergence issues such as loss spikes, divergence, optimiser pathologies, or numerical instability, and develop diagnostics that improve visibility into training health.

  • System-Model Co-Design: Collaborate with the Compute Performance, Data, Evaluation, and Post-Training teams to ensure full pipeline alignment across the model lifecycle, while satisfying performance requirements and hardware constraints (e.g., memory bandwidth and communication topology).

  • Distributed Training Debugging: Diagnose and resolve complex failures in large-scale distributed runs, including communication failures, race conditions, synchronisation issues, and other hard-to-reproduce problems.

Core Qualifications

  • You are proficient in Python and deeply familiar with PyTorch-based training workflows.

  • You have a strong track record in machine learning research and software engineering, demonstrated through shipped models, impactful open-source contributions, or published research.

  • You have a strong mathematical foundation and are comfortable reasoning formally about optimisation, scaling behaviour, and training dynamics.

  • You deeply understand transformer training dynamics, optimisation, and the behaviour of large distributed training jobs.

  • You can design rigorous experiments, reason clearly from noisy results, and translate empirical observations into robust training decisions.

  • You apply strong software engineering practices, including writing maintainable, well-tested code and supporting reproducible experimentation workflows.

  • You are able to implement complex model architectures efficiently and reliably and to debug complex issues across model code, training dynamics, and distributed systems.

  • You collaborate effectively within a research and engineering team and communicate clearly about your work across Pre-training and the broader AAR/AA organization.

  • You are able to work in Germany and collaborate regularly on site in Heidelberg as part of the Pre-training team.

Preferred Qualifications

  • You have experience training large language models (LLMs) or multimodal models on large GPU clusters.

  • You have experience with distributed training frameworks such as torchtitan, Megatron-LM, or DeepSpeed.

  • You have experience with scaling laws, hyperparameter transfer, or other methods for predicting large-scale training behaviour from smaller experiments.

  • You have experience diagnosing and improving training stability in large runs, including divergence, numerical instability, or optimiser pathologies.

  • You have experience profiling, debugging, or improving the performance of large distributed training jobs.

  • You are familiar with sparse training approaches such as Mixture-of-Experts and the associated systems and routing trade-offs.

  • You have a track record of research excellence demonstrated through publications in top-tier conferences (e.g. NeurIPS, ICML, ICLR), impactful open-source contributions, or other significant technical work.

  • We do not require prior experience in low-level kernel optimisation for this role, but we value curiosity about the hardware and systems constraints that shape model design and training at scale.

What we offer

  • Become part of an AI revolution!

  • 30 days of paid vacation

  • Access to a variety of fitness & wellness offerings via Wellhub

  • Mental health support through nilo.health

  • Substantially subsidized company pension plan for your future security

  • Subsidized Germany-wide transportation ticket

  • Budget for additional technical equipment

  • Flexible working hours for better work-life balance and hybrid working model

  • Virtual Stock Option Plan

  • JobRad® Bike Lease

Skills Required

  • Proficient in Python and familiar with PyTorch-based training workflows
  • Strong track record in machine learning research and software engineering
  • Strong mathematical foundations and ability to reason about optimisation and scaling behaviour
  • Understand transformer training dynamics and large distributed training jobs
  • Experience designing rigorous experiments and translating empirical observations into robust training decisions
  • Strong software engineering practices, including maintainable code and reproducible workflows
  • Ability to implement complex model architectures and debug issues in distributed systems
  • Ability to collaborate effectively in a research team and communicate work clearly
  • Ability to work in Germany and collaborate on site
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Baden-Württemberg
254 Employees
Year Founded: 2019

What We Do

We are an AI research and application company that researches, develops and operationalises large-scale AI models for language, image data and strategy, thereby contributing to securing Europe's digital sovereignty

Similar Jobs

Aleph Alpha Logo Aleph Alpha

Senior AI Researcher - Pre-training Data (m/f/d)

Artificial Intelligence • Information Technology • Internet of Things
Hybrid
Heidelberg, Baden-Württemberg, DEU
254 Employees

Motorola Solutions Logo Motorola Solutions

Support Engineer

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
Germany
23000 Employees

Celonis Logo Celonis

Account Executive

Big Data • Information Technology • Productivity • Software • Analytics • Business Intelligence • Consulting
Remote or Hybrid
Germany
3000 Employees

Tapestry - Coach and Kate Spade Logo Tapestry - Coach and Kate Spade

Supervisor

eCommerce • Fashion • Other • Retail • Sales • Wearables • Design
Hybrid
Wertheim, Baden-Württemberg, DEU
16000 Employees

Similar Companies Hiring

Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account