Machine Learning Engineer — Training Optimization

Reposted 24 Days Ago
Hiring Remotely in World Golf Village, FL, USA
In-Office or Remote
Mid level
Artificial Intelligence • Information Technology • Software
The Role
The ML Engineer will optimize large-scale model training pipelines, improve distributed training strategies, build robust infrastructure, and collaborate on training techniques and performance metrics.
Summary Generated by Built In
About the Role

We’re looking for an ML Engineer focused on training optimization to help us scale and improve large-scale model training. You’ll work at the intersection of research and production, optimizing training pipelines for speed, stability, and cost—while collaborating closely with researchers pushing model architecture and capability forward.

This is a high-impact role with real ownership: your work directly affects how fast we can iterate, how large we can scale, and how efficiently we deploy new models.

What You’ll Do
  • Optimize large-scale model training pipelines (throughput, convergence, stability, and cost)

  • Improve distributed training strategies (data, model, and pipeline parallelism)

  • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8)

  • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements

  • Collaborate with researchers on architecture-aware training strategies

  • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility)

  • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels)

  • Own training performance metrics and continuously push them forward

What We’re Looking For
  • Strong experience training large neural networks (LLMs or similarly large models)

  • Hands-on experience with training optimization (not just model usage)

  • Solid understanding of:

    • Backpropagation, optimization algorithms, and training dynamics

    • Distributed systems for ML training

  • Experience with PyTorch (required)

  • Comfort working close to hardware (GPUs, memory, networking constraints)

  • Ability to move fluidly between research ideas and production-ready code

Nice to Have
  • Experience with large-scale distributed training (multi-node, multi-GPU)

  • Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks

  • Experience optimizing training on AMD or NVIDIA GPUs

  • Contributions to open-source ML infrastructure or research codebases

  • Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Why Join Us
  • Real ownership at Series-A stage — your work shapes the company’s trajectory

  • Work on cutting-edge models and training systems at scale

  • Small, highly technical team with fast feedback loops

  • Strong emphasis on engineering quality and research rigor

  • Competitive compensation + meaningful equity

Skills Required

  • Strong experience training large neural networks
  • Hands-on experience with training optimization
  • Solid understanding of backpropagation and optimization algorithms
  • Solid understanding of distributed systems for ML training
  • Experience with PyTorch
  • Comfort working close to hardware
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
20 Employees
Year Founded: 2023

What We Do

We enable serverless inference via our GPU orchestration and model load-balancing system. We unlock fine-tuning by enabling organizations to size their server fleet to throughput needs, not number of models in the catalogue. See it in action on our public cloud, which offers inference for 10k+ open weight models.

Similar Jobs

Applied Systems Logo Applied Systems

Recruiting Operations Lead

Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics
Remote or Hybrid
United States
3040 Employees
100K-150K Annually

ReversingLabs Logo ReversingLabs

Marketing Manager

Information Technology • Software • Cybersecurity
Remote
United States
307 Employees
168K-178K Annually

Basis Logo Basis

Software Engineer

AdTech • Digital Media • Marketing Tech • Software • Automation
Easy Apply
Remote
United States
815 Employees
90K-134K Annually

Drata Logo Drata

Recruiter

Security • Software • Cybersecurity • Automation
Remote
United States
600 Employees
106K-164K Annually

Similar Companies Hiring

Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account