ElastixAI

AI Software Engineer

Reposted 6 Days Ago

Seattle, WA, USA

Hybrid

Mid level

Artificial Intelligence • Hardware • Machine Learning • Generative AI

The Role

Design and optimize a low-level AI inference serving stack: customize open-source frameworks, build model partitioning/scheduling, integrate with proprietary accelerators, profile and optimize across Python orchestration to C++ kernels and drivers, and enable PyTorch-native deployment tooling.

Summary Generated by Built In

About Elastix AI

We are building the next-gen AI inference platform.

Description

Job Title: Software Engineer, AI Inference Platform

Company: ElastixAI, Inc.

Location: Seattle, WA (Hybrid - 3 days/week in office)

About ElastixAI

ElastixAI is an early-stage startup building the next-generation AI inference infrastructure — co-designed across ML software and custom accelerator hardware. Our platform dynamically optimizes inference efficiency and scalability across diverse deployments, enabling adaptive, high-performance AI serving.

Role Summary

We’re looking for a systems-minded AI Software Engineer to join our core inference platform team. You’ll design and extend the low-level serving stack — hacking open-source frameworks like vLLM, SGLang, and TensorRT-LLM, building new model sharding and scheduling logic, and integrating deeply with our proprietary AI accelerator. This role sits at the intersection of ML systems, compiler/runtime engineering, and hardware-software co-design.

Key Responsibilities

Architect, extend, and optimize core components of our AI serving platform for throughput, latency, and scalability.
Customize open-source serving frameworks (e.g., vLLM) for proprietary model ingestion and accelerator integration.
Develop efficient model partitioning, scheduling, and memory management strategies for multi-device inference.
Collaborate with ML engineers on model export and runtime optimization (quantization, graph transforms).
Work closely with hardware engineers to influence accelerator interface design and performance tuning.
Build APIs and runtime tools enabling flexible, PyTorch-native model deployment on our infrastructure.
Profile, debug, and optimize across the full stack — from Python orchestration to C++ kernels and PCIe drivers.

Required Qualifications

BS/MS/PhD in Computer Science, Electrical/Computer Engineering, or related field.
3+ years of professional experience in systems programming, ML infrastructure, or distributed inference.
Proficient in C++ and Python, with strong debugging and performance analysis skills.
Deep familiarity with one or more LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, DeepSpeed-Inference, etc.).
Understanding of model deployment internals — token scheduling, KV caching, batching, and pipelined inference.
Comfortable working close to the hardware abstraction layer — CUDA, PCIe, memory management, or runtime scheduling.
Strong collaboration and communication skills; ability to work cross-functionally in a fast-paced startup environment.

Preferred / Bonus

Experience with hardware-aware ML optimization, compiler/runtime integration, or accelerator SDKs.
Hands-on experience profiling GPU/accelerator workloads.
Familiarity with containerized deployments (Docker/Kubernetes).
Exposure to distributed systems or large-scale inference clusters.
Contributions to open-source ML or serving frameworks.

What We Offer:

A chance to be a foundational engineer in an innovative AI startup
A dynamic and collaborative work environment and the change to have a significant impact on new technology
The opportunity to work on challenging problems at the intersection of ML, software, and systems.
Competitive compensation and startup equity package
Comprehensive medical, dental, and vision coverage (100% paid by employer)
Life insurance and AD&D
Flexible Time Off (FTO)
12-paid holidays
Paid parental leave
Gym or fitness benefit
Commuter benefit
Weekly catered lunches in the office
Investment in employee learning & development

Skills Required

BS/MS/PhD in Computer Science, Electrical/Computer Engineering, or related field
3+ years professional experience in systems programming, ML infrastructure, or distributed inference
Proficient in C++ and Python with strong debugging and performance analysis skills
Deep familiarity with one or more LLM serving frameworks (vLLM, SGLang, TensorRT-LLM, DeepSpeed-Inference, etc.)
Understanding of model deployment internals (token scheduling, KV caching, batching, pipelined inference)
Comfortable working near the hardware abstraction layer (CUDA, PCIe, memory management, runtime scheduling)
Strong collaboration and communication skills; ability to work cross-functionally in a fast-paced startup
Experience with hardware-aware ML optimization, compiler/runtime integration, or accelerator SDKs
Hands-on experience profiling GPU/accelerator workloads
Familiarity with containerized deployments (Docker/Kubernetes)
Exposure to distributed systems or large-scale inference clusters
Contributions to open-source ML or serving frameworks

View all jobs at ElastixAI

View ElastixAI Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Year Founded: 2007

What We Do

ElastixAI delivers elastic, cost-efficient AI inference by co-designing machine learning models, system software, and reconfigurable hardware as a unified architecture. Their mission is to enable adaptable and cost-efficient GenAI inference infrastructure, driving breakthroughs and making Artificial Super Intelligence accessible to everyone. By removing inefficiencies at every layer of the stack, they achieve lower total cost of ownership and reduced power consumption per token.