As a Senior Technical Program Manager with a passion for data-driven operations, you will lead the DGX Cloud Fleet Health reporting program — delivering real-time, actionable insights on the availability and reliability of our GPU fleet. A core focus of this role is advancing Mean-Time-Between-Interruption (MTBI): understanding the root causes of fleet interruptions, surfacing patterns in the data, and driving cross-functional programs to measurably extend fleet uptime. You will partner closely with Capacity Operations, Infrastructure, SRE, and Engineering teams to translate complex fleet signals into decisions that directly improve customer experience. Join us in making a significant impact on the world's most powerful AI infrastructure.
What You’ll Be Doing:
Define and own the metrics framework for measuring fleet health, reliability, and MTBI across a diverse and rapidly scaling GPU fleet.
Lead hands-on data investigations — querying telemetry, correlating failure signals, and building statistical models — to identify the root causes of interruptions and quantify their impact.
Own and drive execution of cross-functional MTBI improvement programs end-to-end — from translating analytical findings into a prioritized roadmap, to holding teams accountable to milestones and delivering measurable reliability gains.
Build and maintain dashboards, automated anomaly detection, and alerting frameworks that surface gaps in fleet health reporting in real time.
Anticipate and close reporting gaps with new cloud providers and hardware platforms by working closely with Infrastructure bring-up teams.
Communicate complex data findings and program status clearly to senior leadership, turning raw signals into crisp narratives and recommendations.
What We Need to See:
8+ years of Technical Program Management experience, with at least 3 years in infrastructure, platform, or reliability-focused domains.
Strong hands-on data analytics skills — comfortable writing SQL, working with large telemetry datasets, and building dashboards (Grafana, Superset, Databricks, or equivalent).
Demonstrated ability to define and operationalize reliability metrics (MTBI, MTTR, availability SLAs) and drive engineering teams toward measurable improvements.
Proven ability to lead deep-dive investigations across ambiguous, multi-system problems and translate findings into long-term solutions.
Excellent executive communication skills — able to distill complex technical findings into clear, decision-ready narratives for senior leadership.
MS in EE, CS, or equivalent experience.
Ways to stand out from the crowd:
Familiarity with NVIDIA GPU architectures and DGX/HGX infrastructure.
Experience with Databricks, Apache Spark, or other large-scale data processing platforms.
Hands-on experience with Grafana, Superset, or similar observability/BI tooling.
Background in cloud-native infrastructure, Kubernetes, or large-scale distributed systems.
You will also be eligible for equity and benefits.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.Skills Required
- 8+ years of Technical Program Management experience with at least 3 years in infrastructure, platform, or reliability domains.
- Strong hands-on data analytics skills, including writing SQL and working with large telemetry datasets.
- Experience building dashboards and observability tooling (Grafana, Superset, Databricks, or equivalent).
- Demonstrated ability to define and operationalize reliability metrics (MTBI, MTTR, availability SLAs).
- Proven ability to lead deep-dive investigations across ambiguous, multi-system problems and drive long-term solutions.
- Excellent executive communication skills for presenting complex technical findings to senior leadership.
- MS in EE, CS, or equivalent experience.
- Familiarity with NVIDIA GPU architectures and DGX/HGX infrastructure.
- Experience with Apache Spark or other large-scale data processing platforms.
- Background in cloud-native infrastructure, Kubernetes, or large-scale distributed systems.
NVIDIA Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about NVIDIA and has not been reviewed or approved by NVIDIA.
-
Equity Value & Accessibility — Equity awards and a discounted ESPP are highlighted as core parts of total compensation, enabling employees to share in the company’s success. Stock-based compensation and the two-year lookback ESPP are consistently described as especially valuable.
-
Healthcare Strength — Health coverage is portrayed as robust, with comprehensive medical, dental, and vision options alongside mental health support and on-site care resources. Employer HSA contributions and wellness perks reinforce the depth of the offering.
-
Retirement Support — Retirement programs are depicted as strong, featuring a meaningful 401(k) match with Roth options and support for Mega Backdoor Roth contributions. These elements position long-term savings as a notable advantage of the total rewards package.
NVIDIA Insights
What We Do
NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”









