Head of AI Inference & MLOps

Posted 4 Days Ago
Be an Early Applicant
Areia, Paraíba, BRA
In-Office
Senior level
Fintech • Financial Services
The Role
Lead AI inference and MLOps strategy for a 7MW AI datacenter, optimizing revenue through model selection, workload management, and marketplace integration.
Summary Generated by Built In

Location: Austin, Texas area / On-site preferred
Project: 7MW Phase I AI Datacenter -> 50MW Campus Expansion
Reports to: Founders / Executive Team

About the Project

We are building a high-density AI datacenter campus outside Austin, Texas, beginning with approximately 7MW of NVIDIA GB300 NVL72 infrastructure and scaling to 50MW+. The initial deployment is designed around real-time inference, reasoning, and high-value AI serving workloads, with a focus on monetizing capacity in live markets rather than simply leasing powered space.

This is not a traditional datacenter operations role.

We are hiring the person who will make the racks make money.

This leader will own the strategy and execution required to turn rack-scale GPU infrastructure into a profitable inference business: selecting the right models, runtimes, orchestration stack, routing layer, pricing strategy, customer segments, and marketplace relationships to maximize revenue, uptime, and utilization.

The right candidate understands that raw compute is not the business. Monetized tokens, latency-adjusted utilization, and gross margin are the business.

The Role

We need a senior operator-builder who can sit at the intersection of:

  • AI infrastructure

  • inference performance engineering

  • model serving and routing

  • marketplace monetization

  • customer / partner integration

  • revenue optimization

You will design and run the inference platform that determines how our GB300 NVL72 racks are monetized in the real-time market. That may include direct enterprise workloads, marketplace distribution, API-based reselling, model hosting, fine-tuned/private deployments, and emerging inference channels.

You should know what makes money on modern inference hardware, what does not, and why.

You should be able to answer questions like:

  • Which open-weight and commercial-compatible models should run on this hardware first?

  • How should workloads be split between premium low-latency serving, bulk throughput, reserved capacity, and experimental capacity?

  • Should we route through third-party marketplaces, sell directly, or do both?

  • What software stack gives us the best performance per watt, per GPU, and per dollar of capex?

  • How do we maximize realized revenue rather than theoretical benchmark performance?

  • How do we scale from a 7MW launch to a repeatable 50MW AI factory operating model?

What You’ll Own

  • Build and lead the inference monetization strategy for our first 7MW deployment and expansion to 50MW

  • Define the technical and commercial operating model for turning GB300 NVL72 racks into revenue-producing assets

  • Evaluate and implement the model serving stack, scheduling layer, inference engine, observability stack, and API platform

  • Select and optimize the mix of workloads across:

    • real-time inference

    • reasoning workloads

    • premium low-latency API traffic

    • batch / overflow workloads

    • dedicated enterprise deployments

    • private/fine-tuned model hosting

  • Identify the best go-to-market channels for capacity monetization, including direct sales and marketplace/API distribution partners

  • Develop strategy for integration with platforms such as OpenRouter-style aggregation, OpenAI-compatible endpoints, and other inference distribution channels where appropriate. OpenRouter provides a unified API and provider aggregation layer, while Inference.net offers an OpenAI-compatible API experience around model access and deployment, making both relevant examples of the ecosystem this role would evaluate. (OpenRouter)

  • Own benchmarking methodology based on actual profit and production metrics, not vanity metrics

  • Drive workload placement decisions based on revenue per rack, revenue per GPU-hour, revenue per MW, latency targets, and customer value

  • Partner with datacenter engineering, networking, and facilities teams to ensure the physical plant supports the intended software monetization strategy

  • Build pricing, SLAs, utilization strategy, and customer segmentation framework

  • Create dashboards and control systems for:

    • utilization

    • queue health

    • latency

    • token throughput

    • margin by workload

    • failure rate

    • realized revenue by cluster / rack / model / customer

  • Lead decisions around multi-tenant vs single-tenant deployments, reserved vs on-demand capacity, and when to prioritize direct contracts over marketplace traffic

  • Build and manage the team required to scale this function over time

What Success Looks Like

In the first 3–6 months, you will:

  • Stand up a production inference platform for our initial GB300 NVL72 deployment

  • Recommend the highest-value initial workloads and monetization channels

  • Launch a repeatable commercialization strategy for rack capacity

  • Establish a clear performance and revenue measurement framework

  • Identify where we should sell capacity: direct, through marketplaces, via strategic partners, or through a hybrid approach

  • Turn the first cluster into a measurable cash-generating operation

In the first 12 months, you will:

  • Build the operating playbook for scaling from 7MW to 50MW

  • Increase utilization without destroying margins or SLA quality

  • Improve realized revenue per rack through model, routing, pricing, and customer mix optimization

  • Establish the company as a serious real-time inference operator, not just a GPU owner

Required Experience

  • Significant experience in production AI/LLM inference, MLOps, model serving, or AI infrastructure monetization

  • Proven experience running or scaling GPU-backed inference systems in production

  • Strong understanding of modern inference runtimes, serving frameworks, and optimization techniques

  • Experience with one or more of:

    • vLLM

    • TensorRT-LLM

    • SGLang

    • Ray Serve

    • Triton Inference Server

    • Kubernetes-based GPU orchestration

    • custom routing / scheduler layers

  • Experience optimizing for real-world production metrics such as throughput, latency, GPU utilization, availability, and cost efficiency

  • Strong understanding of LLM inference economics, including tradeoffs among model size, quantization, latency, throughput, memory footprint, and customer willingness to pay

  • Experience building or managing API-based AI platforms or inference products

  • Ability to translate infrastructure capability into a pricing and product strategy

  • Experience working with enterprise customers, developer platforms, or AI marketplaces

  • Strong technical judgment on model selection, infrastructure topology, and commercialization strategy

Preferred Experience

  • Experience monetizing large-scale NVIDIA GPU infrastructure

  • Experience with rack-scale or cluster-scale inference environments

  • Background in both technical operations and business strategy

  • Familiarity with AI inference aggregators, routing platforms, and model marketplaces

  • Experience designing multi-tenant GPU systems with strong isolation and predictable performance

  • Experience with advanced observability, token-level metering, cost accounting, and SLA enforcement

  • Familiarity with reasoning-model workloads, agentic inference, multimodal inference, and future high-density AI factory architectures

  • Experience supporting OpenAI-compatible APIs and enterprise private deployments

What Makes Someone Great in This Role

  • You know the difference between “high benchmark performance” and “high realized revenue”

  • You understand that some workloads are great for utilization but terrible for margin

  • You can spot when a shiny model is commercially useless

  • You know how to tune systems for the workloads customers will actually pay for

  • You are opinionated about the stack, but flexible about the business model

  • You can go deep technically and still think like an owner

Compensation

Competitive salary, bonus, and equity participation tied to the scale, importance, and revenue generated from the role.

Top Skills

Kubernetes
Nvidia Gb300 Nvl72
Ray Serve
Sglang
Tensorrt-Llm
Triton Inference Server
Vllm
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Austin, TX
20 Employees
Year Founded: 2025

What We Do

Turning market data into decisive action

Similar Jobs

Motorola Solutions Logo Motorola Solutions

Sales Engineer

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
Brazil
23000 Employees

CrowdStrike Logo CrowdStrike

Senior Customer Success Manager

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
2 Locations
10000 Employees

Motorola Solutions Logo Motorola Solutions

Senior Software Engineer

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
Brazil
23000 Employees

Motorola Solutions Logo Motorola Solutions

Senior Full-stack Engineer

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Remote or Hybrid
Brazil
23000 Employees

Similar Companies Hiring

Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Rain Thumbnail
Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3 • Infrastructure as a Service (IaaS)
New York, NY
100 Employees
Granted Thumbnail
Mobile • Insurance • Healthtech • Financial Services • Artificial Intelligence
New York, New York
23 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account