Director, Infrastructure

Posted 3 Days Ago
Be an Early Applicant
3 Locations
In-Office
250K-350K Annually
Senior level
Artificial Intelligence • Software
The Role
Lead the Infrastructure Engineering team at Fluidstack, overseeing the design and deployment of GPU clusters while ensuring operational reliability. Collaborate with various departments and maintain a hands-on approach to managing hardware and software integration for AI workloads.
Summary Generated by Built In
About Fluidstack

At Fluidstack, we’re building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta, and more - to unlock compute at the speed of light.

We’re working with urgency to make AGI a reality. As such, our team is highly motivated and committed to delivering world-class infrastructure. We treat our customers’ outcomes as our own, taking pride in the systems we build and the trust we earn. If you’re motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what's next.

About the Role

Fluidstack is hiring a Director of Infrastructure to own the hardware that powers some of the largest AI clusters in the world. You will lead a team of Networking Engineers, Compute Systems Engineers, Storage Engineers, and ICT team, and coordinate tightly with Procurement, DC Operations, Software Engineering, SRE, Finance, Security, and Sales to ensure Fluidstack can deliver and clusters faster and operate them more reliably than anyone else in the world. You are expected to be exceptional at both ends of the communication spectrum: technically precise with engineering stakeholders, and credible with customers, partners, and executive stakeholders.

You have personally shipped a 10,000+ GPU cluster using current-generation hardware. You know what it takes to bring one up in weeks rather than months, and you have built the tooling, runbooks, and team culture to do it repeatedly.

You Will
  • Own the technical design, deployment, and operational reliability of Fluidstack's bare-metal clusters across all production sites, covering compute, storage, and networking infrastructure.

  • Lead the Infrastructure Engineering organization, comprising Networking Engineers, Compute Systems Engineers, and Storage Engineers, with high standards for technical depth, deployment velocity, and on-call reliability.

  • Drive cluster architecture decisions for current-generation GPU systems (NVIDIA, AMD, and other XPUs), including server configuration, frontend and backend fabric design, storage topology, and rack power and cooling envelope.

  • Coordinate with Supply Chain on OEM relationships, hardware specifications, and delivery timelines to ensure the physical infrastructure roadmap stays one step ahead of customer commitments.

  • Partner with Data Center Operations on new site bring-ups, ensuring smooth handoff from civil and MEP completion through ICT work like rack placement and network cabling, and then to hardware racking, burn-in, and customer acceptance testing.

  • Work with Software Engineering and SRE to define infrastructure requirements for managed Kubernetes, SLURM, and inference serving, ensuring the physical layer meets the demands of the software stack.

  • Build and maintain deployment tooling, burn-in automation, and hardware lifecycle management systems that enable your team to operate at a pace and reliability level that sets Fluidstack apart.

  • Stay hands-on: participate in design reviews, be present for critical cluster bring-ups, and engage directly with complex infrastructure failures to maintain technical credibility with your team and across the organization.

  • Travel as needed to data centers, OEM facilities, customer sites, and industry events to stay close to the hardware, the partners, and the market.

  • Coordinate with Finance on infrastructure CapEx planning and cost modeling, with Security on hardening and compliance requirements, and with Sales on pre-sales technical diligence and capacity commitments to customers.

Basic Qualifications
  • 10+ years of infrastructure engineering experience, with at least 3 years in a technical leadership role managing a team of systems, networking, or storage engineers.

  • Demonstrated ownership of the design, deployment, and operation of a 10,000+ GPU cluster using a recent-generation accelerator (Blackwell, Hopper, or equivalent XPU), from physical hardware bring-up through production steady-state.

  • On-site, hands-on experience physically deploying hardware in data centers, with a clear sense of what it takes to execute a fast, reliable cluster bring-up.

  • Deep expertise in high-performance networking for AI workloads: InfiniBand (XDR/NDR) or RoCEv2 fabric design, large-scale BGP and ECMP architectures, and switch and cable plant management.

  • Strong working knowledge of GPU server hardware internals: NVLink and PCIe topology, NVMe configurations, BMC and firmware management.

  • Experience with high-performance parallel and distributed storage systems for AI training workloads, such as DDN/Lustre, WekaFS, VAST, and open source solutions.

  • Exceptional written and verbal communication skills, with the ability to translate between deep technical detail and high-level summaries for engineering, executive, and customer audiences.

Preferred Qualifications
  • Prior experience at a hyperscaler, neocloud, or GPU OEM in a senior infrastructure or systems engineering role.

  • Experience building and operating bare-metal management tools like MaaS, Netbox, Redfish, including automation of imaging, firmware updates, and hardware lifecycle workflows.

  • Hands-on experience with GPU NPI processes: hardware qualification, acceptance testing, burn-in procedures, and vendor escalation for platform-level defects at cluster scale.

  • Familiarity with current-generation networking products (InfiniBand, RoCE) and the systems-level tradeoffs between them for large-scale AI training and inference.

  • Experience with data center physical infrastructure tradeoffs relevant to GPU-dense deployments: direct liquid cooling, rear-door heat exchangers, high-density PDU and busway configurations, and their impact on cluster layout and availability.

  • An understanding of the software running on these clusters, including Kubernetes, SLURM, PyTorch, and JAX, sufficient to reason about how infrastructure decisions affect workload performance and reliability.

  • Experience representing infrastructure capabilities in customer-facing or commercial contexts, including pre-sales technical diligence with enterprise or government customers.

Salary and Benefits

The base salary range for this role is $250,000 to $350,000. Starting salary will be determined based on relevant experience, skills, and market location. In addition to base salary, this role includes a meaningful equity package, performance bonus, and the following benefits:

  • Competitive total compensation package (salary + equity).

  • Retirement or pension plan, in line with local norms.

  • Health, dental, and vision insurance.

  • Generous PTO policy, in line with local norms.

We are committed to pay equity and transparency.

Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

You will receive a confirmation email once your application has successfully been accepted. If there is an error with your submission and you did not receive a confirmation email, please email [email protected] with your resume/CV, the role you've applied for, and the date you submitted your application-- someone from our recruiting team will be in touch.

Top Skills

Amd
Ddn
Infiniband
Kubernetes
Lustre
Nvidia
Nvlink
Nvme
Pcie
Rocev2
Slurm
Wekafs
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: London
30 Employees
Year Founded: 2017

What We Do

Instantly reserve dedicated clusters of NVIDIA H200s and GB200s for any scale to supercharge your training and inference workflows.

Similar Jobs

In-Office
4 Locations
31171 Employees
158K-195K Annually

Intel Corp Logo Intel Corp

Director-Analog Design & Infrastructure Design Automation

Artificial Intelligence • Cloud • Information Technology • Software
In-Office
4 Locations
141941 Employees
221K-312K Annually

Fluidstack Logo Fluidstack

Director, Infrastructure Supply Chain

Artificial Intelligence • Software
In-Office
3 Locations
30 Employees
150K-200K Annually
In-Office or Remote
3 Locations
621 Employees

Similar Companies Hiring

Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account