Senior/Staff Infrastructure Engineer

Reposted 18 Days Ago
Easy Apply
Be an Early Applicant
San Francisco, CA, USA
In-Office
180K-250K Annually
Senior level
Cloud • Digital Media • Information Technology
Generative media platform for developers.
The Role
Build and maintain Python-based fleet management and server tooling for thousands of GPU servers, automate provisioning, health monitoring, diagnostics, and recovery, create metrics/dashboards, enforce OS security, manage storage, tune Linux for AI workloads, and drive resolutions with partners.
Summary Generated by Built In

You are a hands-on engineer who builds the software and processes that keep a large fleet of GPU servers healthy and productive. You write systems and tooling for managing 1000s of servers including  provisioning, health monitoring, error detection, and recovery — and when something breaks that automation can’t fix, you drive resolution with partners.

Key responsibilities
  • Build and maintain Python fleet tracking system that manages the full lifecycle of servers including contracting and procurement, target use, pricing, availability, health, RMAs, etc
  • Build server management tooling that automates provisioning, health checks, GPU diagnostics, recovery and alerting
  • Create and maintain metrics, dashboards, and alerting for hardware health across the fleet (GPU errors, disk failures, network issues, thermals)
  • Leverage AI to an extreme level to build tools and automate alerting and recovery
  • Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation
  • Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage
  • Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes)
  • Develop a suite of automated error detection and recovery processes
  • Work with partners to solve technical issues
Requirements
  • 5+ years experience managing bare-metal and VM server fleets at scale (100+ nodes)
  • Strong software engineering skills in Python; you write production tooling, not scripts
  • Deep Linux systems knowledge: boot process, kernel tuning, networking, storage, systemd, cgroups, namespaces, performance profiling
  • Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init
  • Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning
  • Familiarity with hardware diagnostics and failure modes (GPUs, NVMe, NICs, memory)
  • Experience building internal tools or dashboards for infrastructure visibility
  • Excellent communication and ability to drive technical decisions across teams
  • Self-starter who executes quickly, takes ownership, and constantly seeks improvement
Nice to have
  • Familiarity with network configuration and diagnostics (VLAN, VXLAN, ECMP, BGP, tcpdump)
  • Experience with NVIDIA GPU infrastructure: driver management, health monitoring, DCGM, NVLink/NVSwitch diagnostics, RDMA, InfiniBand/RoCEv2
  • Experience with AMD GPUs
  • Experience with bare metal and VM provisioning (PXE/iPXE, Kickstart, libvirt, Qemu/KVM)
  • Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001)
Compensation
  • $180,000-250,000 plus equity + benefits
Location
  • San Francisco, CA

What we offer at fal
  • Interesting and challenging work
  • A lot of learning and growth opportunities
  • We are currently hiring in downtown San Francisco.
  • We offer visa sponsorship and will help you relocate to San Francisco.
  • Health, dental, and vision insurance (US)
  • Regular team events and offsites

Top Skills

Ansible
Apparmor
Cgroups
Cloud-Init
Container Runtimes
Cuda
Gpfs
Hugepages
I/O Schedulers
Linux
Lustre
Lvm
Namespaces
Nfs
Numa
Nvidia Drivers
Nvme
Python
Raid
Selinux
Ssh
Systemd
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
73 Employees

What We Do

Generative Media Cloud

Similar Jobs

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
4 Locations
5550 Employees
127K-249K Annually

MongoDB Logo MongoDB

Software Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
4 Locations
5550 Employees
127K-249K Annually

Ivo Logo Ivo

Staff Software Engineer

Artificial Intelligence • Software
In-Office
San Francisco, CA, USA
45 Employees
225K-485K Annually

Gusto Logo Gusto

Staff Engineer

Fintech • HR Tech
Easy Apply
Hybrid
3 Locations
4405 Employees
189K-278K Annually

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account