LLM Pre-training & Distributed Engineer (AI Infrastructure)

Posted 3 Days Ago
Be an Early Applicant
8 Locations
In-Office or Remote
Senior level
Agency • Artificial Intelligence • Blockchain • Web3
The Role
Design, orchestrate, and optimize large-scale LLM pre-training across 1,000+ GPUs. Implement 3D parallelism, manage GPU clusters (SLURM/Kubernetes), optimize InfiniBand/RDMA networking and memory, and automate checkpointing and failure recovery for long training runs.
Summary Generated by Built In

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing  distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

  • Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background (C++, CUDA, Python).

Skills Required

  • Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
  • Experience orchestrating distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
  • Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
  • Experience managing SLURM or Kubernetes-based GPU clusters.
  • Strong systems engineering background with C++, CUDA, and Python.
  • Automate checkpointing and failure recovery for month-long training runs.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
7 Employees
Year Founded: 2024

What We Do

Hyphen Connect is a Web3 and AI talent agency and crypto-integrated software solutions provider that connects blockchain, DeFi, NFT, and AI companies with specialized technical and go-to-market talent globally and remotely. They deliver headhunting, data-driven research, and recruitment services across infrastructure, exchanges, gaming, and DeFi projects, plus industry analysis and hiring insights to help clients build engineering and product teams.

Similar Jobs

Optum Logo Optum

Services Resource Planning Manager - Remote in Canada

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
In-Office or Remote
Vancouver, BC, CAN
160000 Employees
66K-137K Annually

Toast Logo Toast

Staff Software Engineer

Cloud • Fintech • Food • Information Technology • Software • Hospitality
Remote
Canada
5000 Employees
142K-227K Annually

Rubrik Logo Rubrik

Sales Engineer

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Cybersecurity • Data Privacy
In-Office or Remote
7 Locations
3000 Employees
129K-206K Annually

Magna International Logo Magna International

CNC Technician- Midnight Shift

Automotive • Hardware • Robotics • Software • Transportation • Manufacturing
Remote or Hybrid
Woodbridge, ON, CAN
171000 Employees
35-43 Hourly

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account