Operations Engineer, HPC Networking

Posted 8 Days Ago
Hiring Remotely in USA
Remote
Mid level
Cloud • Digital Media • Information Technology
Generative media platform for developers.
The Role
As an Operations Engineer, you will manage HPC networking, monitor and debug InfiniBand and Ethernet fabrics, and support fabric bring-up while enhancing operational tools and runbooks.
Summary Generated by Built In

fal is the generative media ecosystem powering the next generation of AI products. We build the infrastructure, tools, and model access that teams need to move from idea to production, and do it at scale without compromise. For developers and enterprises, fal is the foundation that makes generative media not just possible, but practical: a unified platform where high-performance inference, orchestration, and observability come together to unlock new categories of AI-native products.

As generative media reshapes industries across a market projected to grow by hundreds of billions over the next decade, fal is becoming the ecosystem that ambitious teams build on.

About the role

We're hiring an Operations Engineer for HPC Networking to keep our InfiniBand and Ethernet fabrics healthy as we scale.

This is a hands-on role. You'll bring up new fabrics alongside DC ops, monitor the ones in production, and chase down the weird stuff: link flaps, congestion, NCCL stalls, firmware bugs that only show up at scale. 

You're a fit if you've:
  • Operated InfiniBand fabrics in production: subnet manager, routing, partitioning, monitoring.
  • Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL.
  • Brought up new fabrics from cable pull through validation.
  • Scripted your way through repetitive operational work (bash, python, go, whatever).
  • Nice to have: Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking.
Who you are:
  • Detail-oriented. Cable plant hygiene is a personality trait.
  • Calm under fire. A fabric incident during a customer training run doesn't rattle you.
  • You read vendor release notes for fun, or at least out of self-defense.
  • You'd rather find the root cause than reboot the switch.
Responsibilities:
  • Monitor health and performance of InfiniBand and Ethernet fabrics: switches, HCAs, transceivers, links.
  • Investigate and resolve fabric issues: connectivity, congestion, performance regressions.
  • Support fabric bring-up alongside DC ops and customer-facing teams.
  • Run maintenance and upgrades on switches and control plane components.
  • Partner with cluster ops on cross-domain incidents where the line between compute and network is blurry.
  • Improve the tooling and runbooks so the next incident resolves faster than the last.

Skills Required

  • Operated InfiniBand fabrics in production
  • Debugged the full stack: cables, transceivers, switch firmware, HCAs, drivers, NCCL
  • Brought up new fabrics from cable pull through validation
  • Scripted operational work (bash, python, go)
  • Experience with Ethernet RoCE, Spectrum-X, or large-scale GPU cluster networking
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
73 Employees

What We Do

Generative Media Cloud

Similar Jobs

Wipfli Logo Wipfli

Senior Consultant

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
Reston, VA, USA
3000 Employees

Wipfli Logo Wipfli

Senior Consultant

Cloud • Fintech • Software • Business Intelligence • Consulting • Financial Services
Remote or Hybrid
Chicago, IL, USA
3000 Employees
88K-118K Annually

Mondelēz International Logo Mondelēz International

Digital Supply Chain Engineering Director

Big Data • Food • Hardware • Machine Learning • Retail • Automation • Manufacturing
Remote or Hybrid
3 Locations
90000 Employees
143K-235K Annually

Granica Logo Granica

Software Engineer

Artificial Intelligence • Big Data • Cloud • Machine Learning • Software • Business Intelligence • Data Privacy
In-Office or Remote
Mountain View, California, USA
45 Employees

Similar Companies Hiring

Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account