We are on a mission to reinvent how designers work in the AI era. We’re backed by top investors including First Round, Chemistry, Homebrew, Scribble and senior leaders from OpenAI, Meta, Google, Ramp, Stripe and more. We’re building the next-generation AI design tool for product teams.
About the RoleWe’re hiring an AI Platform Engineer to own how our models run in production. You’ll build the inference stack that delivers sub-second responses to designers at scale, optimize latency and cost, and own the reliability of every AI capability in the product. This is the role for someone who lives in serving infrastructure and treats GPU utilization like a craft.
You’ll own the platform layer end-to-end: serving, autoscaling, observability, deployment, and the cost-and-latency economics of running models at scale.
What You’ll DoArchitect and operate the inference platform: serving stack, autoscaling, multi-tenancy, observability
Optimize end-to-end latency (TTFT, TPOT, p95) with quantization, batching, KV-cache management, and speculative decoding
Design multi-GPU parallelism strategies (DP / TP / PP) and own GPU utilization and cost economics
Build a hybrid local + cloud serving architecture — small models on the user’s Mac for fast paths, larger models in the cloud for slow paths
Own canary deployment, rollback automation, and SLO/SLA-driven reliability for all AI features
Build production observability: latency, drift, quality, and cost dashboards
Evaluate and integrate inference engines (vLLM, Triton, TGI, TensorRT, MLX) for cloud and on-device paths
Take fine-tuned models from research artifacts to production traffic
8+ years software engineering experience
2+ years deploying ML or LLM systems at production scale
Deep, demonstrable experience with one or more inference serving systems (vLLM, Triton, TGI, TensorRT, ONNX Runtime)
Concrete production wins on latency and throughput engineering (p50/p95/p99, GPU utilization, cost-per-token)
Reliability engineering depth: canary deployment, rollback, SLO-driven ops, on-call readiness
Cloud and Kubernetes-based ML deployment experience
Multi-GPU parallelism experience (FSDP, DDP, TP, PP) a strong plus
On-device inference experience (MLX, Core ML, ONNX Runtime on consumer hardware)
Production experience with quantization, distillation, and mixed-precision inference
Experience with streaming inference and real-time AI UX
Background running inference at startup scale — comfortable with cost-per-user economics, not just raw throughput
The inference platform powering every AI feature in the product
Sub-second response paths for high-frequency design actions
A hybrid local + cloud serving architecture, with intelligent routing between fast and slow paths
Observability infrastructure: latency, drift, quality, and cost
Multi-model orchestration with on-device fast paths and cloud slow paths
Reliable, measurable, real-time streaming AI experiences
Salary: $300,000-$400,000 base salary
Equity: Meaningful stock options
Health Insurance: Best-in-class coverage for the employee and their entire family
Location: San Francisco HQ
Similar Jobs
What We Do
Noon is an AI-native product design platform that provides a dual-canvas tool for product designers. By integrating design and production-ready code, it eliminates the gap between the two, allowing designers to create, iterate, build, test, and ship products directly from a single canvas. Founded in 2024, the company aims to redefine product design workflows through AI-driven, code-centric solutions that work in seconds rather than minutes.








