Machine Learning Engineer — Inference Optimization

Reposted 19 Days Ago
Hiring Remotely in World Golf Village, FL
In-Office or Remote
Mid level
Artificial Intelligence • Information Technology • Software
The Role
Optimize inference latency and throughput for large-scale ML models, collaborating on performance tuning, and building inference-serving systems.
Summary Generated by Built In
About the Role

We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.

This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.

What You’ll Do
  • Optimize inference latency, throughput, and cost for large-scale ML models in production

  • Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)

  • Implement and tune techniques such as:

    • Quantization (fp16, bf16, int8, fp8)

    • KV-cache optimization & reuse

    • Speculative decoding, batching, and streaming

    • Model pruning or architectural simplifications for inference

  • Collaborate with research engineers to productionize new model architectures

  • Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)

  • Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups

  • Improve system reliability, observability, and cost efficiency under real workloads

What We’re Looking For
  • Strong experience in ML inference optimization or high-performance ML systems

  • Solid understanding of deep learning internals (attention, memory layout, compute graphs)

  • Hands-on experience with PyTorch (or similar) and model deployment

  • Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)

  • Experience scaling inference for real users (not just research benchmarks)

  • Comfortable working in fast-moving startup environments with ownership and ambiguity

Nice to Have
  • Experience with LLM or long-context model inference

  • Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)

  • Experience optimizing across different hardware vendors

  • Open-source contributions in ML systems or inference tooling

  • Background in distributed systems or low-latency services

Why Join Us
  • Real ownership over performance-critical systems

  • Direct impact on product reliability and unit economics

  • Close collaboration with research, infra, and product

  • Competitive compensation + meaningful equity at Series A

  • A team that cares about engineering quality, not hype

Top Skills

Cuda
Ml Inference Optimization
Onnx Runtime
PyTorch
Tensorrt
Triton
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, California
20 Employees
Year Founded: 2023

What We Do

We enable serverless inference via our GPU orchestration and model load-balancing system. We unlock fine-tuning by enabling organizations to size their server fleet to throughput needs, not number of models in the catalogue.

See it in action on our public cloud, which offers inference for 10k+ open weight models.

Similar Jobs

SharkNinja Logo SharkNinja

Quality Engineer - Indoor Heated

Beauty • Robotics • Design • Appliances • Manufacturing
Remote
United States
4000 Employees
60K-115K Annually

SharkNinja Logo SharkNinja

Account Manager

Beauty • Robotics • Design • Appliances • Manufacturing
Remote
United States
4000 Employees
84K-155K Annually

SharkNinja Logo SharkNinja

Sr. Manager - Product Development - Outdoor

Beauty • Robotics • Design • Appliances • Manufacturing
Remote
United States
4000 Employees
105K-187K Annually

SharkNinja Logo SharkNinja

Account Manager

Beauty • Robotics • Design • Appliances • Manufacturing
Remote
United States
4000 Employees
105K-167K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account