Senior ML Training Engineer

Reposted 9 Days Ago
Be an Early Applicant
2 Locations
In-Office
Senior level
Artificial Intelligence • Information Technology • Software
The Role
As a Senior ML Training Engineer, you will architect and implement distributed training solutions, optimize multi-GPU setups, and support customer training workflows in AI infrastructure.
Summary Generated by Built In

AION is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, AION democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.

By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises.

Led by high-pedigree founders with previous exits, AION is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in India, London and Seattle.

Who You Are

You're an ML systems engineer who's passionate about building high-performance inference infrastructure. You don't need to be an expert in everything - this field is evolving too rapidly for that - but you have strong fundamentals and the curiosity to dive deep into optimization challenges. You thrive in early-stage environments where you'll learn cutting-edge techniques while building production systems. You think systematically about performance bottlenecks and are excited to push the boundaries of what's possible in AI infrastructure.


RequirementsKey Responsibilities
  • Architect and implement distributed training solutions for customers running pre-training, fine-tuning, and RL workloads on AION infrastructure.
  • Guide customers through large-scale training implementations including data parallelism, model parallelism, and pipeline parallelism strategies.
  • Design and optimize multi-GPU training setups with proper gradient synchronization, communication strategies, and scaling configurations.
  • Optimize and develop POCs for customer training accelerators including efficient data loading pipelines, gradient checkpointing, and memory optimization techniques.
  • Create comprehensive monitoring and debugging frameworks for distributed training jobs with performance tracking and bottleneck resolution.
  • Conduct technical workshops and training sessions on distributed training, reasoning techniques, and post-training optimization methodologies.
  • Support customers with advanced fine-tuning workflows including reward model training, constitutional AI, and alignment techniques.
  • Troubleshoot and resolve customer training bottlenecks including scaling inefficiencies and optimization challenges.
  • Collaborate with tech and product teams to translate customer needs into platform improvements and feature requirements.
Skills & Experience
  • High agency individual looking to own customer success and influence training platform architecture.
  • 4+ years of ML engineering experience with focus on training large-scale models and distributed systems.
  • Expert-level PyTorch experience including distributed training, DDP implementation, and multi-GPU optimization.
  • Production experience with distributed training techniques including data parallelism, model parallelism, pipeline parallelism.
  • Strong understanding of gradient synchronization and communication strategies for multi-node training.
  • Hands-on experience with large dataset handling and efficient data loading at scale.
  • Proficiency in training infrastructure tools such as Megatron-LM, DeepSpeed, FairScale, or similar frameworks.
  • Excellent communication and teaching skills with ability to explain complex technical concepts to diverse audiences.
  • Customer-facing experience in technical consulting, solutions engineering, or developer relations roles.
  • Experience with RLHF and fine-tuning pipelines including reward model training and post-training optimization.
  • Understanding of reasoning techniques including Chain-of-Thought prompting and advanced reasoning workflows.
Nice to have

Large-scale pre-training experience (7B+ parameters), advanced reasoning implementation (Tree-of-Thought, self-consistency), DPO and constitutional AI expertise, open-source contributions to training frameworks, conference speaking or technical evangelism experience.


Benefits
  • Join the ground floor of a mission-driven AI startup revolutionizing compute infrastructure.
  • Work with a high-caliber, globally distributed team backed by major VCs.
  • Competitive compensation and benefits.
  • Fast-paced, flexible work environment with room for ownership and impact.
  • Hybrid model: 3 days in-office, 2 days remote with flexibility to work remotely for part of the year.

In case you got any questions about the role please reach out to hiring manager on linkedin or X.

Top Skills

Deepspeed
Fairscale
Megatron-Lm
PyTorch
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
21 Employees
Year Founded: 2023

What We Do

Everyday AI Platform: aion collapses the entire ai development lifecycle into a single, unified workspace. From data to deployment - everything at your fingertips. aion simplifies AI infrastructure the way Stripe simplified payments:

Plug-and-Play Multi-Provider Access
Customer Infrastructure Management
Deploy and optimize AI infrastructure via prompts with integrated cost tracking and performance analytics
Partner Sales & Resource Optimization

Track opportunities with confidential pricing, manage real-time inventory allocation, and monitor profitability from aion workloads

Similar Jobs

Anduril Logo Anduril

Technical Program Manager

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Seattle, WA, USA
6000 Employees
166K-220K Annually

Block Logo Block

Account Executive

Blockchain • eCommerce • Fintech • Payments • Software • Financial Services • Cryptocurrency
In-Office
Seattle, WA, USA
12000 Employees
123K-223K Annually

Pfizer Logo Pfizer

Medical Director, Late Development, Oncology

Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
Hybrid
6 Locations
121990 Employees
226K-377K Annually

Rokt Logo Rokt

Senior Software Engineer

Artificial Intelligence • Digital Media • eCommerce • Marketing Tech • Software • Automation
In-Office
Seattle, WA, USA
800 Employees
200K-325K Annually

Similar Companies Hiring

Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account