Hyphen Connect Limited

LLM Pre-training & Distributed Engineer (AI Infrastructure)

Posted 24 Days Ago

8 Locations

In-Office or Remote

Senior level

Agency • Artificial Intelligence • Blockchain • Web3

The Role

Design, orchestrate, and optimize large-scale LLM pre-training across 1,000+ GPUs. Implement 3D parallelism, manage GPU clusters (SLURM/Kubernetes), optimize InfiniBand/RDMA networking and memory, and automate checkpointing and failure recovery for long training runs.

Summary Generated by Built In

We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.

Responsibilities:

Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
Automate checkpointing and failure recovery during month-long training runs.

Required Skills:

Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
Experience managing SLURM or Kubernetes-based GPU clusters.
Strong systems engineering background (C++, CUDA, Python).

Skills Required

Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
Experience orchestrating distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
Experience managing SLURM or Kubernetes-based GPU clusters.
Strong systems engineering background with C++, CUDA, and Python.
Automate checkpointing and failure recovery for month-long training runs.

View all jobs at Hyphen Connect Limited

View Hyphen Connect Limited Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

7 Employees

Year Founded: 2024

What We Do

Hyphen Connect is a Web3 and AI talent agency and crypto-integrated software solutions provider that connects blockchain, DeFi, NFT, and AI companies with specialized technical and go-to-market talent globally and remotely. They deliver headhunting, data-driven research, and recruitment services across infrastructure, exchanges, gaming, and DeFi projects, plus industry analysis and hiring insights to help clients build engineering and product teams.