ML Solutions Engineer

Posted 6 Days Ago
Be an Early Applicant
Las Vegas, NV
In-Office
Senior level
Artificial Intelligence • Cloud • Software
The Role
The ML Solutions Engineer will migrate and optimize CUDA workloads to ROCm, working with teams and clients to optimize performance on AMD hardware.
Summary Generated by Built In

ML Solutions Engineer (ROCm Portability)

At TensorWave, we’re leading the charge in AI compute, building a versatile cloud platform that’s driving the next generation of AI innovation. We’re focused on creating a foundation that empowers cutting-edge advancements in intelligent computing, pushing the boundaries of what’s possible in the AI landscape.

About the Role:

We are seeking an exceptional ML Solutions Engineer who specializes in GPU portability and performance optimization. This is a senior-level role for someone who has significant experience with CUDA, ROCm, and kernel development, and is passionate about enabling workloads to run efficiently on AMD hardware.

As a technical expert, you will help migrate and optimize CUDA-based workloads to ROCm, working with both internal teams and third-party developers. You will play a critical role in advancing our ROCm enablement strategy and driving adoption across the ecosystem.

Key Responsibilities:

  • Partner with customers, internal engineering, and third-party developers to migrate CUDA workloads to ROCm.

  • Profile, debug, and optimize GPU kernels for performance, scalability, and efficiency.

  • Contribute to ROCm enablement across open source ML frameworks and libraries.

  • Leverage tools such as Composable Kernel, HIP, PyTorch/XLA, and RCCL to enable and tune distributed training workloads.

  • Provide technical guidance on best practices for GPU portability, including kernel-level optimizations, mixed precision, and memory hierarchy usage.

  • Act as a technical liaison, translating customer requirements into actionable engineering work.

  • Create internal documentation, playbooks, and training material to scale knowledge across teams.

  • Represent TensorWave in the broader ROCm ecosystem through contributions, collaboration, and customer advocacy.

Qualifications:

Must-Have:

  • 5+ years of experience in GPU programming, ML infrastructure, or HPC roles.

  • Strong hands-on experience with CUDA, HIP, and ROCm.

  • Proficiency in kernel development (e.g., CUDA, HIP, Composable Kernel, Triton).

  • Deep knowledge of GPU performance profiling tools (Nsight, rocprof, perf, etc.).

  • Understanding of distributed ML workloads (e.g., PyTorch Distributed, MPI, RCCL).

  • Proven ability to work in customer-facing technical roles, including solution design and workload migration.

  • Strong programming skills in Python, C++, and GPU kernel languages.

Nice-to-Have:

  • Contributions to ROCm-enabled open source ML frameworks (PyTorch, Megatron, vLLM, SGLang, etc.).

  • Familiarity with compiler technology (LLVM, MLIR, XLA).

  • Experience with containerized environments and Kubernetes for GPU workloads.

  • Knowledge of performance modeling for multi-GPU and multi-node workloads.

  • Familiarity with AI/ML workload benchmarking and tuning at scale.

  • Foundation in networking, especially as it pertains to RDMA, RoCE, and Infiniband.

What Success Looks Like

Customers successfully migrate and optimize their CUDA workloads to ROCm, with measurable performance gains.

Strong collaboration between internal engineering and external developers leads to faster enablement of ROCm workloads.

Best practices, playbooks, and tooling are well-documented and continuously improved.

Make GPUs go Brrrrrrr

What We Bring:

Stock Options

100% paid Medical, Dental, and Vision insurance

Life and Voluntary Supplemental Insurance

Short Term Disability Insurance

Flexible Spending Account

401(k)

Flexible PTO

Paid Holidays

Parental Leave

Mental Health Benefits through Spring Health

Top Skills

C++
Composable Kernel
Cuda
Gpu Programming
Hip
Kernel Development
Python
Pytorch/Xla
Rccl
Rocm
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Las Vegas, Nevada
56 Employees

What We Do

TensorWave is a cutting-edge cloud platform designed specifically for AI workloads. Offering AMD MI300X accelerators and a best-in-class inference engine, TensorWave is a top-choice for training, fine-tuning, and inference. Visit tensorwave.com to learn more.
Send us a message to try it for free.

Similar Jobs

Wells Fargo Logo Wells Fargo

Personal Banker Reno NV

Fintech • Financial Services
Hybrid
Reno, NV, USA
213000 Employees
21-28 Hourly

Wells Fargo Logo Wells Fargo

Personal Banker Yerington

Fintech • Financial Services
Hybrid
Yerington, NV, USA
213000 Employees
21-28 Hourly

WeLocalize Logo WeLocalize

Shape the Future of AI — Spanish Talent Hub

Machine Learning • Natural Language Processing
In-Office or Remote
17 Locations
2331 Employees

RethinkFirst Logo RethinkFirst

Account Executive

Edtech • Healthtech • HR Tech • Information Technology • Professional Services • Software • Telehealth
In-Office
Las Vegas, NV, USA
300 Employees

Similar Companies Hiring

PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account