Spellbrush

HPC/ML Infrastructure Engineer

Reposted 15 Days Ago

2 Locations

In-Office

Mid level

Computer Vision • Gaming • Sports • Esports

The Role

Lead bring-up, administration, and operations of a large GPU/AI training cluster. Serve as bridge between researchers and hardware, ensuring SLURM jobs, parallel filesystems, networking, and monitoring operate reliably. Work across provisioning, storage, VPN/access, and traditional Linux sysadmin tasks; assist with physical racking and on-site datacenter needs. Collaborate closely with a small research team in Tokyo or San Francisco.

Summary Generated by Built In

We’re looking for an experienced HPC infrastructure engineer to lead bringup, administration, and operations on is probably the largest anime AI training cluster in the world. You’ll serve as the bridge between our researchers and the bare GPU machines, helping to make sure that SLURM jobs are running, parallel filesystems are serving, network is transmitting, and that the anime models are training.

You may be a good fit if:You love anime and the anime aesthetic.

This probably one of the only jobs in the world where you will get to combine your love of anime and large-scale GPU systems.

You’re familiar with the modern HPC software landscape

Once upon a time, our team could install SLURM on a few bare metal nodes and get away with it. Now the landscape has become unbelievable complex, with SLURM deploys through Slinky on K8s, provisioning through warewulf/MAAS/ansible, filesystems through WEKA/VAST/Ceph, VPN and access through tailscale, and monitoring via the Grafana/Prometheus stack. We’re looking for someone with relevant experience up and down the stack (and maybe a papercut or two to show for it!)

As well as the traditional sysadmin landscape

Bringing up and managing cluster still requires good old linux sysadmin skills, including wrangling ldap, triaging dmesg, and setting sticky bits on directories for misbehaving users and tools.

You're not afraid of physical computers

We’re building out edge datacenters and our CEO is still personally racking, stacking, and provisioning HGX-based nodes in our living room. Also his VLAN design sucks and he’s bad at fiber routing. Please send help.

And you're comfortable working on small, fast-paced teams.

We currently have a very tiny research team, and you’ll be directly helping some of the AI researchers in the world train the best anime image model in the world.

We also believe in the unmatched speed of in-person teams, and prefer on-site collaboration in either our primary research office in Tokyo (downtown Akihabara), or San Francisco (dogpatch!). Bay area is strongly preferred as we have physical hardware in the Bay Area. Visa sponsorships are available.

Skills Required

Experience as an HPC infrastructure engineer
Experience with SLURM
Experience deploying SLURM via Slinky on Kubernetes (K8s)
Experience with provisioning tools: Warewulf, MAAS, Ansible
Experience with parallel filesystems: WEKA, VAST, Ceph
Experience with VPN/access tooling such as Tailscale
Experience with monitoring stack: Grafana and Prometheus
Strong Linux sysadmin skills (LDAP, system logs/dmesg, permissions)
Experience operating large-scale GPU training infrastructure (HGX/GPU clusters)
Comfort with physical server handling, racking, cabling, and datacenter ops
Comfort working on small, fast-paced teams and directly supporting researchers
Willingness/ability to collaborate on-site in Tokyo (Akihabara) or San Francisco (Dogpatch); Bay Area strongly preferred
Affinity for anime and anime aesthetic

View all jobs at Spellbrush

View Spellbrush Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Francisco, California

23 Employees

Year Founded: 2018

What We Do

Here at Spellbrush, we're passionate about making a good anime game. We also happen to be the world's leading generative AI studio — we're the team behind niji・journey. We are currently investigating how AI can be used to help human artists perform masterpieces in the most complex medium of our times: videogames. Our games are characterized by a no-compromise approach to well-balanced gameplay married to a truthful love of visual arts. If you love turn-based tactics games, please consider joining!