ML Cluster Operations Engineer

Posted 4 Days Ago
Be an Early Applicant
Las Vegas, NV
In-Office
Senior level
Artificial Intelligence • Cloud • Software
The Role
Manage distributed machine learning workloads using Slurm and Kubernetes, ensuring cluster operations and mentoring engineers in best practices.
Summary Generated by Built In

ML Cluster Operations Engineer (Slurm / K8s)

At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

About the Role:

We are seeking an exceptional Machine Learning Engineer who has made training and AI workload scheduling a specialty. This is a senior-level role for someone who has significant experience managing distributed machine learning workloads at scale using Slurm and/or Kubernetes.

As a technical visionary and hands-on expert, you will lead the evolution of our managed Slurm and Kubernetes offerings, as well as internal health checking and cluster automation.

Key Responsibilities:

  • Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.

  • Work closely with our engineering team to develop and maintain CI and automation for managed offerings.

  • Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.

  • Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.

  • Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.

  • Mentor and upskill ML engineers in best practices.

Qualifications:

Must-Have:

  • 5+ years of experience in cloud infrastructure, HPC, or machine learning roles.

  • Significant hands-on experience with Slurm in production HPC/ML environments, including understanding of setup/configuration, enroot (pyxis), modules, and MPI.

  • Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.

  • Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.

  • Deep understanding of security, compliance, and resilience in containerized workloads.

Nice-to-Have:

  • 3+ years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.

  • Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.

  • Experience with DAGs using K8s native tools such as Argo Workflows.

  • Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.

  • Experience with low level kernel libraries, such as CUDA and Composable Kernel.

  • Contributions to open-source projects or ML/AI tooling.

What Success Looks Like

  • A production-grade integrated Slurm platform that can support thousands of GPUs, with self-healing, scaling, and strong observability.

  • Infrastructure is resilient, secure, resource-optimized, and compliant.

  • Best practices and tooling are well-documented, standardized, and continuously improved across the company.

  • Make GPUs go Brrrrrrr

What We Bring:

Stock Options

100% paid Medical, Dental, and Vision insurance

Life and Voluntary Supplemental Insurance

Short Term Disability Insurance

Flexible Spending Account

401(k)

Flexible PTO

Paid Holidays

Parental Leave

Mental Health Benefits through Spring Health


Top Skills

C10D
Kubernetes
Megatron
Mpi
Python
PyTorch
Slurm
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Las Vegas, Nevada
56 Employees

What We Do

TensorWave is a cutting-edge cloud platform designed specifically for AI workloads. Offering AMD MI300X accelerators and a best-in-class inference engine, TensorWave is a top-choice for training, fine-tuning, and inference. Visit tensorwave.com to learn more.
Send us a message to try it for free.

Similar Jobs

Wells Fargo Logo Wells Fargo

Operations Coordinator

Fintech • Financial Services
Hybrid
Mesquite, NV, USA
213000 Employees
21-27 Hourly
Hybrid
Las Vegas, NV, USA
213000 Employees
24-35 Hourly

Wells Fargo Logo Wells Fargo

Teller Mesquite Branch

Fintech • Financial Services
Hybrid
Mesquite, NV, USA
213000 Employees
20-25 Hourly

PwC Logo PwC

Consultant

Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Hybrid
34 Locations
370000 Employees
77K-202K Annually

Similar Companies Hiring

Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees
PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account