Senior Systems Software Engineer, Kubernetes Node Lifecycle - DGX Cloud

Posted Yesterday
Be an Early Applicant
2 Locations
In-Office
184K-357K Annually
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Role
Lead development and maintenance of node provisioning, OS image build/packaging, and nodepool lifecycle for NVIDIA Kubernetes Engine. Build CAPI providers, BYON onboarding, hardened image pipelines with automated CVE remediation and compliance gating, and automated test suites. Troubleshoot node-layer failures at scale, collaborate with upstream Kubernetes/CAPI communities, and ensure reliable GPU-optimized node operations across large clusters.
Summary Generated by Built In

At NVIDIA, the DGX Cloud division merges fresh hardware and software innovations to offer leading accelerated computing solutions for the most challenging AI workloads worldwide. Our team of skilled engineers is committed to addressing major global issues, consistently advancing technology, and making a difference in millions of lives around the world!

We are looking for a Senior Systems Software Engineer with strong experience in Kubernetes node engineering, OS image packaging, and cloud infrastructure. The ideal candidate will possess deep hyperscaler-level knowledge across the entire node lifecycle. This covers CAPI providers, bring-your-own-node onboarding, OS image build pipelines, packaging, and nodepool management. They must have the technical depth needed to maintain cluster reliability at frontier AI scale. In this vital role, you will manage the node layer within NVIDIA Kubernetes Engine (NKE). Your work will ensure it scales to fulfill DGX Cloud's two main goals: supporting internal researchers and enabling NCPs. Are you prepared to innovate?

What you'll be doing:

  • Direct the building and refinement of CAPI providers for NVIDIA Kubernetes Engine, maintaining steady, consistent, and scalable node provisioning across DGX Cloud and NCP environments.

  • Develop and maintain bring-your-own-node workflows that allow customers to integrate different NVIDIA hardware into NKE clusters while ensuring high operational consistency.

  • Coordinate OS image generation, packaging, deployment, and update processes for NKE nodes. Ensure images are fine-tuned for NVIDIA GPU workloads and satisfy enterprise- and cloud-grade security and compliance criteria.

  • Develop and sustain node image hardening pipelines, incorporating CIS benchmarks, automated CVE remediation, and promotion gates connected to security posture.

  • Develop and maintain automated test suites for node images. These tests verify accuracy across Kubernetes versions and NVIDIA hardware configurations. This process occurs prior to production deployment and facilitates continuous validation through modern CI/CD pipelines.

  • Handle nodepool lifecycle at scale, including provisioning, upgrades, drain and cordon workflows, and seamless node replacement across very large clusters with diverse NVIDIA hardware.

  • Examine, resolve, and determine underlying causes of node-layer faults in production NKE clusters, such as those involving image configuration, driver packaging, kubelet operation, and hardware activation, and review and optimize the node layer in real-world high-scale scenarios.

  • Partner with upstream communities including Cluster API, Kubernetes, and CNCF projects to establish node provisioning and lifecycle standards in accordance with NKE requirements. Communicate your progress and findings at internal and external gatherings such as KubeCon and GTC.

What we need to see:

  • 8 years of experience with a background in systems software, cloud infrastructure, or Kubernetes node engineering.

  • Bachelor’s or Master’s degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience.

  • Deep expertise in Cluster API (CAPI), including provider development and full machine lifecycle from provisioning to deletion.

  • Extensive experience with OS image build pipelines, node image packaging, and delivery systems for Kubernetes nodes (for example image-builder, containerd, cloud-init, packer).

  • Practical experience with bring-your-own-node models and integrating diverse hardware into live Kubernetes environments, including large-scale nodepool lifecycle management and upgrades.

  • Strong understanding of kubelet configuration, node bootstrap, and the Kubernetes node registration lifecycle.

  • Experience with node image security, including vulnerability scanning, patch automation, and compliance gating as part of image build pipelines.

  • Proficiency in Golang and/or Python, and hands-on experience with at least one major public cloud provider (GCP, AWS, Azure, OCI or equivalent).

Ways to stand out from the crowd:

  • Direct experience building or maintaining node image pipelines for a hyperscaler Kubernetes distribution (GKE, EKS, AKS, OKE, or equivalent).

  • Experience with supply chain security and hardening for node images, including image signing, provenance attestation, SBOM generation, CIS benchmark consistency, and automated CVE remediation.

  • Experience with automated node provisioning and optimal sizing at scale (for example Karpenter, GKE NAP or similar) and how these interact with GPU workload scheduling.

  • Strong operational experience working with immutable OS image distributions (such as Flatcar, Bottlerocket, Azure Linux) and debugging node-layer failures in large Kubernetes clusters.

  • Proven background of upstream contributions to Cluster API, Kubernetes or related CNCF projects, combined with excellent communication and interpersonal abilities.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 14, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Skills Required

  • 8 years of experience in systems software, cloud infrastructure, or Kubernetes node engineering
  • Bachelor's or Master's degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience
  • Deep expertise in Cluster API (CAPI), including provider development and full machine lifecycle
  • Extensive experience with OS image build pipelines, node image packaging, and delivery systems (e.g., image-builder, containerd, cloud-init, packer)
  • Practical experience with bring-your-own-node models and integrating diverse hardware into live Kubernetes environments, including nodepool lifecycle management and upgrades
  • Strong understanding of kubelet configuration, node bootstrap, and Kubernetes node registration lifecycle
  • Experience with node image security: vulnerability scanning, patch automation, and compliance gating in image build pipelines
  • Proficiency in Golang and/or Python and hands-on experience with at least one major public cloud provider (GCP, AWS, Azure, OCI)
  • Direct experience building or maintaining node image pipelines for a hyperscaler Kubernetes distribution (GKE, EKS, AKS, OKE) or equivalent
  • Experience with supply chain security for node images: image signing, provenance attestation, SBOM generation, CIS benchmarks, automated CVE remediation
  • Experience with automated node provisioning and autoscaling solutions (e.g., Karpenter, GKE NAP) and GPU workload scheduling
  • Operational experience with immutable OS image distributions (Flatcar, Bottlerocket, Azure Linux) and debugging node-layer failures in large clusters
  • Upstream contributions to Cluster API, Kubernetes, or related CNCF projects and strong communication skills

NVIDIA Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about NVIDIA and has not been reviewed or approved by NVIDIA.

  • Equity Value & Accessibility Equity awards and a discounted ESPP are highlighted as core parts of total compensation, enabling employees to share in the company’s success. Stock-based compensation and the two-year lookback ESPP are consistently described as especially valuable.
  • Healthcare Strength Health coverage is portrayed as robust, with comprehensive medical, dental, and vision options alongside mental health support and on-site care resources. Employer HSA contributions and wellness perks reinforce the depth of the offering.
  • Retirement Support Retirement programs are depicted as strong, featuring a meaningful 401(k) match with Roth options and support for Mega Backdoor Roth contributions. These elements position long-term savings as a notable advantage of the total rewards package.

NVIDIA Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
21,960 Employees
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Jobs

Hiya Inc. Logo Hiya Inc.

Support Platform & Operations Lead

Artificial Intelligence • Cloud • Mobile • Security • Software
Hybrid
Seattle, WA, USA
145 Employees
80K-106K Annually

Headway Logo Headway

Infrastructure Engineer

Consumer Web • Healthtech • Professional Services • Social Impact • Software
Hybrid
3 Locations
819 Employees
212K-265K Annually

Samsara Logo Samsara

Sr. Global Supply Manager I

Artificial Intelligence • Cloud • Computer Vision • Hardware • Internet of Things • Software
Easy Apply
Remote or Hybrid
United States
4000 Employees
112K-188K Annually

DigitalOcean Logo DigitalOcean

Technical Program Manager

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
In-Office
Seattle, WA, USA
1400 Employees
148K-186K Annually

Similar Companies Hiring

Fairly Even Thumbnail
Hardware • Robotics • Sales • Software • Hospitality
New York, NY
30 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account