LLM Inference Engineer

Posted 12 Days Ago
Be an Early Applicant
Los Altos, CA, USA
In-Office
Mid level
Software
The Role
Own and optimize the end-to-end LLM serving stack on Majestic hardware: port frameworks (vLLM/SGLang), implement batching, scheduling, paged KV cache, distributed inference, multi-modal preprocessing, speculative/prefix decoding, profile and eliminate bottlenecks across runtime, compiler, and hardware.
Summary Generated by Built In
Description

The Role

In this high-impact role, you are the bridge between cutting-edge custom silicon and production-grade AI. You will own the end-to-end LLM serving stack on Majestic hardware, architecting everything from serving APIs down to KV cache management, batching, and scheduling. Your primary mission is to port leading frameworks like vLLM and SGLang to our accelerator and optimize them for peak performance. Because our architecture offers memory headroom, you won't just match traditional GPUs; you will shatter their limits on throughput, batch sizes, and context lengths. As you hunt down bottlenecks, your insights will directly steer our future kernel, compiler, and hardware development. 

What You'll Own

  • The serving stack, end to end — bring up and adapt a modern inference framework (vLLM, SGLang, or similar) to run on Majestic hardware.
  • The runtime hot path — continuous batching, the scheduler, paged KV cache, and prefill/decode disaggregation.
  • Distributed inference at scale — tensor, pipeline, and expert parallelism across accelerators, wired into our collective communication library (CCL).
  • The multi-modal pipeline — image, audio, and video preprocessing, encoder integration, and mixed-modality batching.
  • Inference-time techniques — speculative decoding, prefix caching, and structured decoding.
  • End-to-end performance — profile, benchmark, and hunt down bottlenecks across the full serving path, feeding findings back to the kernel, compiler, and hardware teams.
Requirements

What We're Looking For

  • 3+ years building or operating production LLM inference and serving systems (5+ preferred).
  • Deep, hands-on work with a modern inference framework vLLM, SGLang, TensorRT-LLM, Fireworks, or similar including its scheduler, paged attention / KV cache, model executor, and backend integration points.
  • Strong Python and C++, with the ability to move fluidly between the two.
  • A real grasp of transformer inference the prefill/decode split, KV cache behavior, and how batching dynamics shape latency and throughput.
  • Distributed inference experience tensor and pipeline parallelism across multiple devices.
  • An instinct for performance you can profile an end-to-end stack and chase a regression from the serving API all the way down to the kernel.

Skills Required

  • 3+ years building or operating production LLM inference and serving systems
  • 5+ years building or operating production LLM inference and serving systems
  • Deep, hands-on work with modern inference frameworks (vLLM, SGLang, TensorRT-LLM, Fireworks, or similar) including scheduler, paged attention / KV cache, model executor, and backend integration
  • Strong Python and C++ skills
  • Solid understanding of transformer inference, prefill/decode split, KV cache behavior, and batching dynamics affecting latency and throughput
  • Distributed inference experience (tensor and pipeline parallelism across multiple devices)
  • Ability to profile, benchmark, and chase regressions across the full serving path down to kernel level
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
55 Employees
Year Founded: 2023

What We Do

Majestic Labs is reimagining AI infrastructure for the world’s most demanding workloads. Today, organizations are forced to overprovision expensive compute just to access the required memory their models need. We took a fundamentally different approach by pairing a massive amount of compute with 1000x the memory to create game changing improvements in performance, power and deployment efficiency. Our customers can literally replace racks of traditional AI infrastructure with a single Majestic server.

Similar Jobs

Capital One Logo Capital One

Artificial Intelligence Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
4 Locations
55000 Employees
197K-246K Annually

Capital One Logo Capital One

Artificial Intelligence Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
4 Locations
55000 Employees
197K-246K Annually

Capital One Logo Capital One

Artificial Intelligence Engineer

Fintech • Machine Learning • Payments • Software • Financial Services
Hybrid
3 Locations
55000 Employees
197K-246K Annually

Anyscale Logo Anyscale

Distributed LLM Inference Engineer

Artificial Intelligence • Software
Hybrid
2 Locations
115 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account