Senior HPC engineer, Research infrastructure

Reposted Yesterday
Be an Early Applicant
Palo Alto, CA
In-Office
180K-220K Annually
Senior level
Digital Media
The Role
As a Senior HPC Engineer, you'll design and manage AI supercomputing clusters, optimize hardware and software performance, and support researchers in large-scale projects.
Summary Generated by Built In
Help Luma build some of the biggest & fastest AI supercomputing clusters in the world! As a High-Performance Computing engineer, you’ll work at the intersection of hardware and software, designing systems that deliver the maximum possible performance for running large-scale AI models. We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud environment. 

For this role, it’s important you understand how to combine CPU’s, GPU’s, and network devices into systems that are then deployed at a large scale to peak efficiency. You understand the lowest levels of the software platforms that sit on top of this hardware, including how to best optimize the Linux kernel and user-space code. You are capable of writing code to automate the monitoring and healing of these systems, commanding a large number of servers with few people.

Responsibilities

  • In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself. 
  • We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve. 
  • You’ll be managing training HPC clusters at Luma from provisioning to performance tuning.
  • Areas of work will include observability, distributed job tracing, GPU diagnostics, software environment management and additional tooling plus work on the actual code to enable necessary features.
  • We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

Experience

  • 8+ years experience as infrastructure engineer or Devops in large and complex distributed systems.
  • Deep understanding of networking, bonus points for experience in HPC networking.
  • Experience developing high-quality software in a general-purpose programming language, preferably including Python.
  • Excellent problem-solving skills and attention to detail.
  • Experience with GPUs in large scale clusters is strongly preferred.
  • Strong knowledge of observability and monitoring in distributed systems.
  • Tenacious at troubleshooting hardware and network topology failures in distributed systemsIndependently driven and able to own problems and build solutions from end-to-end.
  • Experience with large scale data center operations, proficiency in cloud orchestration and system tools.

Compensation

  • In addition to cash base pay, you'll also receive a sizable grant of Luma's equity.
  • The pay range for this position is $180000- 220000/yr for Bay Area. Base pay offered will vary depending on job-related knowledge, skills, candidate location, and experience. 

Your application is reviewed by real people.

Top Skills

Cloud Orchestration
Distributed Systems
Gpus
Hpc
Linux
Networking
Python
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Minneapolis, MN
0 Employees

What We Do

Luma is a multimedia platform that delivers personalized movie and TV program selections from a range of sources to its viewers.

Similar Jobs

Anduril Logo Anduril

Software Engineer

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Costa Mesa, CA, USA
138K-207K Annually

Anduril Logo Anduril

Software Engineer

Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
In-Office
Costa Mesa, CA, USA
168K-252K Annually

Turion Space Logo Turion Space

Test Engineer

Aerospace • Artificial Intelligence • Hardware • Information Technology • Software • Defense • Manufacturing
In-Office
Irvine, CA, USA
125K-175K Annually

Chime Logo Chime

Director, Payroll

Fintech • Machine Learning • Mobile • Security • Software
Easy Apply
Hybrid
San Francisco, CA, USA
169K-239K

Similar Companies Hiring

Grocery TV Thumbnail
Software • Retail • Marketing Tech • Hardware • Digital Media • AdTech
Austin, TX
47 Employees
bet365 Thumbnail
Software • Gaming • Esports • Digital Media • Automation
Denver, Colorado
9000 Employees
Hedra Thumbnail
Software • News + Entertainment • Marketing Tech • Generative AI • Enterprise Web • Digital Media • Consumer Web
San Francisco, CA
14 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account