Fluidstack

Director, Infrastructure

Posted 3 Days Ago

Be an Early Applicant

3 Locations

In-Office

250K-350K Annually

Senior level

Artificial Intelligence • Software

The Role

Lead the Infrastructure Engineering team at Fluidstack, overseeing the design and deployment of GPU clusters while ensuring operational reliability. Collaborate with various departments and maintain a hands-on approach to managing hardware and software integration for AI workloads.

Summary Generated by Built In

About Fluidstack

At Fluidstack, we’re building the infrastructure for abundant intelligence. We partner with top AI labs, governments, and enterprises - including Mistral, Poolside, Black Forest Labs, Meta, and more - to unlock compute at the speed of light.

We’re working with urgency to make AGI a reality. As such, our team is highly motivated and committed to delivering world-class infrastructure. We treat our customers’ outcomes as our own, taking pride in the systems we build and the trust we earn. If you’re motivated by purpose, obsessed with excellence, and ready to work very hard to accelerate the future of intelligence, join us in building what's next.

About the Role

Fluidstack is hiring a Director of Infrastructure to own the hardware that powers some of the largest AI clusters in the world. You will lead a team of Networking Engineers, Compute Systems Engineers, Storage Engineers, and ICT team, and coordinate tightly with Procurement, DC Operations, Software Engineering, SRE, Finance, Security, and Sales to ensure Fluidstack can deliver and clusters faster and operate them more reliably than anyone else in the world. You are expected to be exceptional at both ends of the communication spectrum: technically precise with engineering stakeholders, and credible with customers, partners, and executive stakeholders.

You have personally shipped a 10,000+ GPU cluster using current-generation hardware. You know what it takes to bring one up in weeks rather than months, and you have built the tooling, runbooks, and team culture to do it repeatedly.

You Will

Own the technical design, deployment, and operational reliability of Fluidstack's bare-metal clusters across all production sites, covering compute, storage, and networking infrastructure.
Lead the Infrastructure Engineering organization, comprising Networking Engineers, Compute Systems Engineers, and Storage Engineers, with high standards for technical depth, deployment velocity, and on-call reliability.
Drive cluster architecture decisions for current-generation GPU systems (NVIDIA, AMD, and other XPUs), including server configuration, frontend and backend fabric design, storage topology, and rack power and cooling envelope.
Coordinate with Supply Chain on OEM relationships, hardware specifications, and delivery timelines to ensure the physical infrastructure roadmap stays one step ahead of customer commitments.
Partner with Data Center Operations on new site bring-ups, ensuring smooth handoff from civil and MEP completion through ICT work like rack placement and network cabling, and then to hardware racking, burn-in, and customer acceptance testing.
Work with Software Engineering and SRE to define infrastructure requirements for managed Kubernetes, SLURM, and inference serving, ensuring the physical layer meets the demands of the software stack.
Build and maintain deployment tooling, burn-in automation, and hardware lifecycle management systems that enable your team to operate at a pace and reliability level that sets Fluidstack apart.
Stay hands-on: participate in design reviews, be present for critical cluster bring-ups, and engage directly with complex infrastructure failures to maintain technical credibility with your team and across the organization.
Travel as needed to data centers, OEM facilities, customer sites, and industry events to stay close to the hardware, the partners, and the market.
Coordinate with Finance on infrastructure CapEx planning and cost modeling, with Security on hardening and compliance requirements, and with Sales on pre-sales technical diligence and capacity commitments to customers.

Basic Qualifications

10+ years of infrastructure engineering experience, with at least 3 years in a technical leadership role managing a team of systems, networking, or storage engineers.
Demonstrated ownership of the design, deployment, and operation of a 10,000+ GPU cluster using a recent-generation accelerator (Blackwell, Hopper, or equivalent XPU), from physical hardware bring-up through production steady-state.
On-site, hands-on experience physically deploying hardware in data centers, with a clear sense of what it takes to execute a fast, reliable cluster bring-up.
Deep expertise in high-performance networking for AI workloads: InfiniBand (XDR/NDR) or RoCEv2 fabric design, large-scale BGP and ECMP architectures, and switch and cable plant management.
Strong working knowledge of GPU server hardware internals: NVLink and PCIe topology, NVMe configurations, BMC and firmware management.
Experience with high-performance parallel and distributed storage systems for AI training workloads, such as DDN/Lustre, WekaFS, VAST, and open source solutions.
Exceptional written and verbal communication skills, with the ability to translate between deep technical detail and high-level summaries for engineering, executive, and customer audiences.

Preferred Qualifications

Prior experience at a hyperscaler, neocloud, or GPU OEM in a senior infrastructure or systems engineering role.
Experience building and operating bare-metal management tools like MaaS, Netbox, Redfish, including automation of imaging, firmware updates, and hardware lifecycle workflows.
Hands-on experience with GPU NPI processes: hardware qualification, acceptance testing, burn-in procedures, and vendor escalation for platform-level defects at cluster scale.
Familiarity with current-generation networking products (InfiniBand, RoCE) and the systems-level tradeoffs between them for large-scale AI training and inference.
Experience with data center physical infrastructure tradeoffs relevant to GPU-dense deployments: direct liquid cooling, rear-door heat exchangers, high-density PDU and busway configurations, and their impact on cluster layout and availability.
An understanding of the software running on these clusters, including Kubernetes, SLURM, PyTorch, and JAX, sufficient to reason about how infrastructure decisions affect workload performance and reliability.
Experience representing infrastructure capabilities in customer-facing or commercial contexts, including pre-sales technical diligence with enterprise or government customers.

Salary and Benefits

The base salary range for this role is $250,000 to $350,000. Starting salary will be determined based on relevant experience, skills, and market location. In addition to base salary, this role includes a meaningful equity package, performance bonus, and the following benefits:

Competitive total compensation package (salary + equity).
Retirement or pension plan, in line with local norms.
Health, dental, and vision insurance.
Generous PTO policy, in line with local norms.

We are committed to pay equity and transparency.

Fluidstack is an Equal Employment Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans’ status, or any other characteristic protected by law. Fluidstack will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.

You will receive a confirmation email once your application has successfully been accepted. If there is an error with your submission and you did not receive a confirmation email, please email [email protected] with your resume/CV, the role you've applied for, and the date you submitted your application-- someone from our recruiting team will be in touch.