Senior High Performance Computing Cluster Administrator

Posted 2 Days Ago
Be an Early Applicant
Santa Clara, CA
Senior level
Artificial Intelligence • Hardware • Robotics • Software • Metaverse
The Role
The role involves leading the administration of GPU-accelerated HPC clusters, providing architectural guidance, automating system management, and enhancing resource utilization. Responsibilities include coordinating storage solutions, planning for system upgrades, and collaborating with management on equipment issues.
Summary Generated by Built In

NVIDIA's Deep Learning Optimized Frameworks Group is looking for a deeply technical HPC cluster administrator to lead a diverse cluster of GPU-accelerated systems and provide architectural mentorship to product teams in the deep learning and scientific computing domains. As a member of the DLFW Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute cluster that runs demanding deep learning, high performance computing, and computationally intensive workloads. We are looking for an expert to identify architectural changes and/or completely innovative approaches for our GPU Compute Cluster. In this role, you will help us with the strategic challenges we encounter, including compute, networking, and storage design for large-scale, high-performance workloads and effective resource utilization in a heterogeneous compute environment.

What you'll be doing:

  • Administer Linux systems, ranging from powerful DGX servers to embedded systems, bringup hardware to publicly available systems.

  • Coordinate Storage Solutions and plan for growth.

  • Automate configuration management, software updates, and maintenance and monitoring of system availability using modern DevOps tools (Ansible, Gitlab, etc.)

  • Actively connect with management regarding any problems with the equipment and propose resolution.

  • Plan, build and install/upgrade new systems that support NVIDIA DL Software

What we need to see:

  • You have a BA, BS, or MS in CS, EE, CE or equivalent experience

  • 4+ years of previous experience deploying and administrating HPC clusters

  • Familiar with resource scheduling managers (Slurm (preferred), LSF, etc!

  • Proven track record to script in bash, Perl or python

  • Experience with containers (Docker, Singularity, LXC)

  • Deep understanding of operating systems, computer networks, and high-performance applications

  • Ability to work well with developers & test engineers

  • Hard-working dedication to provide quality in support for your users

Ways to stand out from the crowd:

  • Familiarity and prior work experience with technologies such as: Ansible, GIT, Slurm, Zabbix, Prometheus, Grafana and Docker

  • Familiarity with GPU usage in Compute Cluster and Cuda

  • Experience with mobile and embedded systems

  • Basic knowledge of Deep Learning.

  • Experience coding/scripting in Perl/Python/bash

The base salary range is 148,000 USD - 230,000 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

Bash
Perl
Python
The Company
HQ: Santa Clara, CA
21,960 Employees
On-site Workplace
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Jobs

Snap Inc. Logo Snap Inc.

Corporate Security Engineer, 3+ Years of Experience

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Hybrid
3 Locations
5000 Employees
129K-228K Annually

Snap Inc. Logo Snap Inc.

Corporate Security Engineer, 1+ Year of Experience

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
Hybrid
3 Locations
5000 Employees
88K-156K Annually

Snap Inc. Logo Snap Inc.

Security Technical Program Manager, Corporate Security

Artificial Intelligence • Cloud • Machine Learning • Mobile • Software • Virtual Reality • App development
5 Locations
5000 Employees
129K-228K Annually

Motorola Solutions Logo Motorola Solutions

RMA Technician

Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Hybrid
Culver City, CA, USA
21000 Employees

Similar Companies Hiring

TrainingPeaks (A Peaksware Company) Thumbnail
Software • Fitness
Louisville, CO
69 Employees
bet365 Thumbnail
Software • Gaming • eSports • Digital Media • Automation
Denver, Colorado
6100 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account