Training Infrastructure Engineer

Reposted 20 Days Ago
Be an Early Applicant
2 Locations
Hybrid
Senior level
Artificial Intelligence • Music • Software
The Role
Build and optimize the full ML training stack: profile and debug GPU operations, improve throughput, design data pipelines and distributed training, manage SLURM clusters, and set up experiment tracking and versioning to scale generative model training.
Summary Generated by Built In

Mirelo AI is building the next generation of creative tools by generating realistic sound, speech and music from video.

We develop cutting-edge foundational generative AI models that "unmute" silent video content and create custom, hyper-realistic audio for gaming, video platforms, and creators. Our technology empowers global storytellers to transform their content.

We recently closed a $41 million Seed round co-led by Andreessen Horowitz and Index Ventures with participation from Atlantic, and are rapidly expanding across Product, Engineering, Go-to-Market, and Growth.
About the Role

In this role, you’ll focus on the full training stack - profiling GPU behavior, debugging training pipelines, improving throughput, choosing the right parallelism strategies, and designing the infrastructure that lets us train models efficiently at scale. You’ll work across cluster management, model training, efficient data pipelines for video and audio, inference and optimizing pytorch code. Your work will shape the foundation on which all of our generative models are built and iterated.

Key Responsibilities
  • Find ideal training strategies (parallelism approaches, precision trade-offs) for a variety of model sizes and compute loads

  • Profile, debug, and optimize single and multi-GPU operations using tools like Nsight and stack trace viewers to understand what's actually happening at the hardware level

  • Analyze and improve the whole training pipeline from start to end (efficient data storage, data loading, distributed training, checkpoint/artifact saving, logging, …)

  • Set up scalable systems for experiment tracking, data/model versioning, experiment insights.

  • Design, deploy and maintain large-scale ML training clusters running SLURM for distributed workload orchestration

Ideal Candidate Profile
  • Familiarity with the latest and most effective techniques in optimizing training and inference workloads—not from reading papers, but from implementing them

  • Deep understanding of GPU memory hierarchy and computation capabilities—knowing what the hardware can do theoretically and what prevents us from achieving it

  • Experience optimizing for both memory-bound and compute-bound operations and understanding when each constraint matters

  • Expertise with efficient attention algorithms and their performance characteristics at different scales

Nice to Have
  • Experience in implementing custom GPU kernels and integrating them into PyTorch.

  • Experience with diffusion and autoregressive models and understanding of their specific optimization challenges

  • Familiarity with high-performance storage solutions (VAST, blob storage) and understanding of their performance characteristics for ML workloads

  • Experience with managing SLURM clusters at scale

Why Join?
  • Join at a pivotal moment. We've secured fresh funding and are gaining traction - now is when your contributions can make a real difference to our success.

  • True ownership from day one. You'll have genuine autonomy and responsibility. Your ideas and work will directly shape our product and company direction.

  • Competitive compensation and equity. We offer strong packages that ensure you share in the success you help create.

  • Build for the next generation of creators. Be part of the innovation that will transform how creators work and thrive.

We welcome applications from all individuals, regardless of ethnic origin, gender, disability, religion or belief, age, or sexual orientation and identity.

Skills Required

  • Experience optimizing training and inference workloads (implementing techniques, not just reading papers).
  • Deep understanding of GPU memory hierarchy and compute capabilities.
  • Experience optimizing memory-bound and compute-bound operations.
  • Expertise with efficient attention algorithms and their performance characteristics.
  • Experience profiling and debugging GPU workloads using tools like Nsight and stack trace viewers.
  • Experience optimizing PyTorch code and integrating performance improvements into training pipelines.
  • Experience implementing custom GPU kernels and integrating them into PyTorch.
  • Experience with diffusion and autoregressive models and their optimization challenges.
  • Familiarity with high-performance storage solutions (e.g., VAST, blob storage) for ML workloads.
  • Experience designing, deploying, and maintaining large-scale ML training clusters running SLURM.
  • Experience with scalable experiment tracking, data/model versioning, and experiment insights systems.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
14 Employees

What We Do

We believe every frame deserves a sound. Mirelo builds the foundational AI audio layer for video. We are making the connection between video and sound seamless, expressive, and fully generative. Not as a tool. As a new creative medium. European AI. Backed by a16z and Index Ventures.

Similar Jobs

Zscaler Logo Zscaler

Regional Director, Enterprise

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Remote or Hybrid
Germany
8697 Employees
100K-143K Annually

Toast Logo Toast

Principal Software Engineer

Cloud • Fintech • Food • Information Technology • Software • Hospitality
Hybrid
Berlin, DEU
5000 Employees

Braze Logo Braze

Account Executive

Marketing Tech • Mobile • Software
Easy Apply
Hybrid
Berlin, DEU
2000 Employees
200K-1M Annually

Smartly Logo Smartly

Senior Product Manager

AdTech • Artificial Intelligence • Digital Media • Marketing Tech • Social Media • Software • Generative AI
Easy Apply
Hybrid
Berlin, DEU
805 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
31 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account