Research Engineer (LLM Training and Performance)

Reposted 19 Days Ago
Easy Apply
Be an Early Applicant
9 Locations
In-Office
Senior level
Software
The Role
As a Research Engineer, you will enhance LLM training performance, manage the training stack, and optimize multi-node pipelines for large-scale machine learning models.
Summary Generated by Built In

At JetBrains, code is our passion. Ever since we started back in 2000, we have been striving to make the strongest, most effective developer tools on earth. By automating routine checks and corrections, our tools speed up production, freeing developers to grow, discover, and create.

We’re looking for a Research Engineer who will own the training stack and model architecture for our Mellum LLM family. Your job is easier said than done: make training faster, cheaper, and more stable at a large scale. You’ll profile, design, and implement changes to the training pipeline – from architecture to custom GPU kernels, as needed.

As part of our team, you will:
  • Be responsible for improving end-to-end performance for multi-node LLM pre-training and post-training pipelines.
  • Profile hotspots (Nsight Systems/Compute, NVTX) and fix them using compute/comm overlap, kernel fusion, scheduling, etc.
  • Design and evaluate architecture choices (depth/width, attention variants including GQA/MQA/MLA/Flash-style, RoPE scaling/NTK, and MoE routing and load-balancing).
  • Implement custom ops (Triton and/or CUDA C++), integrate via PyTorch extensions, and upstream when possible.
  • Push memory/perf levers: FSDP/ZeRO, activation checkpointing, FP8/TE, tensor/pipeline/sequence/expert parallelism, NCCL tuning.
  • Harden large runs by building elastic and fault-tolerant training setups, ensuring robust checkpointing, strengthening reproducibility, and improving resilience to preemption.
  • Keep the data path fast using streaming and sharded data loaders and tokenizer pipelines, as well as improve overall throughput and cache efficiency.
  • Define the right metrics, build dashboards, and deliver steady improvements.
  • Run both pre-training and post-training (including SFT, RLHF, and GRPO-style methods) efficiently across sizable clusters.
We’ll be happy to bring you on board if you have:
  • Strong PyTorch and PyTorch Distributed experience, having run multi-node jobs with tens to hundreds of GPUs.
  • Hands-on experience with Megatron-LM/Megatron-Core/NeMo, DeepSpeed, or serious FSDP/ZeRO expertise.
  • Real profiling expertise (Nsight Systems/Compute, nvprof) and experience with NVTX-instrumented workflows.
  • GPU programming skills with Triton and/or CUDA, and the ability to write, test, and debug kernels.
  • A solid understanding of NCCL collectives, as well as topology and fabric effects (IB/RoCE), and how they show up in traces.
Our ideal candidate would have experience with:
  • FlashAttention-2 and 3, CUTLASS and CuTe, TransformerEngine and FP8, Inductor, AOTAutograd, and torch.compile.
  • MoE at scale (expert parallel, router losses, capacity management) and long-context tricks (ALiBi/YaRN/NTK scaling).
  • Kubernetes or SLURM at scale, placement and affinity tuning, as well as AWS, GCP, and Azure GPU fleets.
  • Web-scale data plumbing (streaming datasets, Parquet and TFRecord, tokenizer perf), eval harnesses, and benchmarking.
  • Safety and post-training methods, such as DPO, ORPO, GRPO, and reward models.
  • Inference ecosystems such as vLLM and paged KV.

#LI-KP1

We are an equal opportunity employer
We know great ideas can come from anyone, anywhere. That’s why we do our best to create an open and inclusive workplace – one that welcomes everyone regardless of their background, identity, religion, age, accessibility needs, or orientation.

We process the data provided in your job application in accordance with the Recruitment Privacy Policy.

Top Skills

Compute
Cuda
Deepspeed
Fsdp
Megatron-Core
Megatron-Lm
Nccl
Nemo
Nsight Systems
PyTorch
Triton
Zero
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Praha 4
2,209 Employees
Year Founded: 2000

What We Do

JetBrains creates intelligent software development tools consistently used and trusted by 11.4 million professionals and 88 Fortune Global Top 100 companies. Our lineup of more than 30 products includes IDEs for most programming languages and technologies, such as IntelliJ IDEA, PyCharm, and others, as well as products for team collaboration, like YouTrack and TeamCity. JetBrains is also known for creating the Kotlin programming language, a cross-platform language used by more than 5 million developers worldwide yearly and recommended by Google as the preferred language for Android development. The company is headquartered in Prague, Czech Republic, and has offices around the world. JetBrains IDEs * IntelliJ IDEA (Java and Kotlin Developers) * PyCharm (Python developers) * PhpStorm (PHP developers) * GoLand (Go developers) * Rider (.NET developers) * CLion (C and C++ developers) * Rust Rover (Rust developers) * WebStorm (JavaScript & TypesScript developers) * RubyMine (Ruby and Rails developers) * DataGrip (Tool for multiple databases) * ReSharper (Extension for Visual Studio) * Fleet (Multilingual IDE and code editor) * Aqua (IDE for test automation engineers) .NET & Visual Studio: * Rider (IDE for .NET developers) * ReSharper (Extension for Visual Studio) * ReSharper C++ (Visual Studio Extension for C++ developers) * dotCover (.NET Unit Test Runner and Code Coverage Tool) * dotMemory (.NET Memory Profiler) * dotTrace (.NET Performance Profiler) * dotPeek (.NET decompiler and assembly browser) Team Tools: * TeamCity (Powerful CI out of the box) * YouTrack (Project management for all your teams) * Space (Intelligent code collaboration platform) * Datalore (Collaborative data science platform) * Qodana (Code quality platform for teams) Programming Languages: * Kotlin (Programming Language for the JVM and Android) * MPS (Create Your Own Domain-Specific Language) Education: * JetBrains Academy (Learn and Teach Computer Science) Profile by JetBrains s.r.o.

Similar Jobs

Rapid7 Logo Rapid7

Cyber Intelligence Analyst

Artificial Intelligence • Cloud • Information Technology • Sales • Security • Software • Cybersecurity
Remote or Hybrid
Prague, CZE
2400 Employees

Rapid7 Logo Rapid7

Staff Trust, Risk and Compliance Engineer

Artificial Intelligence • Cloud • Information Technology • Sales • Security • Software • Cybersecurity
Remote or Hybrid
Prague, CZE
2400 Employees

Mondelēz International Logo Mondelēz International

S4/o9 Training and Capability Lead

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
12 Locations
90000 Employees

Teya Logo Teya

Business Development Manager

Fintech • Payments • Financial Services
Hybrid
4 Locations
1000 Employees
80K-80K Annually

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account