NVIDIA is looking for an outstanding engineering lead to join its Software Infrastructure and Operations team. The position will be part of a fast-paced crew that develops and maintains sophisticated Kubernetes based development, compute and test environments for a multitude of platforms including Windows and Linux using OSS CICD tools GitHub, GitLab, Jenkins. You will be working with a team of passionate and skilled engineers that are continuously working to provide better tools to build and manage this infrastructure. With your help we would forge the next generation of compute infrastructure multiplying the power of the CPU, GPU and DPU for the age of AI. We need a motivated, hardworking and focused individual who has a real passion for operational excellence, Infrastructure services, and automation.
What you’ll be doing:
Architect the scaling operation in our data centers. Deploy and Support end-to-end container management solutions with Kubernetes, Docker, containerd. Design solutions with service discovery, networking, monitoring, logging, scheduling in Kubernetes.
Manage end to end OSS CICD tools GitLab/GitHub/Jenkins in on-prim Kubernetes environment. Design and develop tools needed for automating CICD & Developers workflow.
Design and build sophisticated automations and AI powered applications.
Use your depth in algorithms and system software background!
Work in teams to deploy new data center infrastructure.
Plan and implement critical metrics tracking using various data analytics mining methods and dashboards.
Reuse AI techniques to extract useful signals about machines and jobs from the data generated!
Take part in prototyping, crafting and developing cloud infrastructure for Nvidia.
What we need to see:
Strong Kubernetes understanding and background especially on-premises setup and extensive experience with Kubernetes components & subsystems.
Experience of maintaining large scale on-prim infrastructure applications & OSS CICD tools using Kubernetes.
Proven programming background in python/Golang/java and/or relevant scripting languages
Excellent debugging and analytical skills and experience in Databases both SQL (MySQL ) and NoSQL (Elastic Search /MongoDB)
Proficient with configuration management tools like Ansible, Chef, Puppet and strong experience with Jenkins and/or other CI systems.
Hands-on experience with VMs, Dockers, Kubernetes Cluster.
Experience with analytics/visualization tools like Kibana, Grafana, Splunk etc. and experience with monitoring systems such as Zabbix and/or Nagios is nice to have
5+ years of proven experience
Bachelors or Master's Degree or equivalent experience in CS, Software Engineering, or related field.
Ways to stand out from the crowd:
Previous experience with DevOps/SRE teams
Thrives in a multi-tasking environment with constantly evolving priorities and documents work well
Outstanding collaboration skills across organizational boundaries, experience with using and improving data centers and with computer algorithms and ability to choose the best possible algorithms to meet the scaling challenge
Ability to divide complex problems into simple sub problems and then reuse available solutions to implement most of those
Experience with designing simple systems that can work reliably without needing much support
Skills Required
- Strong Kubernetes understanding and on-premises setup experience with Kubernetes components and subsystems.
- Experience maintaining large-scale on-prem infrastructure applications and OSS CI/CD tools using Kubernetes.
- Proven programming background in Python, Golang, Java or relevant scripting languages.
- Debugging and analytical skills and experience with SQL (MySQL) and NoSQL (Elastic Search, MongoDB).
- Proficient with configuration management tools (Ansible, Chef, Puppet) and strong experience with Jenkins or other CI systems.
- Hands-on experience with VMs, Docker, and Kubernetes clusters.
- Experience with analytics/visualization tools (Kibana, Grafana, Splunk) and monitoring systems (Zabbix, Nagios).
- 5+ years of proven experience.
- Bachelor's or Master's degree in CS, Software Engineering, or equivalent experience.
- Previous experience with DevOps/SRE teams.
NVIDIA Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about NVIDIA and has not been reviewed or approved by NVIDIA.
-
Equity Value & Accessibility — Equity awards and a discounted ESPP are highlighted as core parts of total compensation, enabling employees to share in the company’s success. Stock-based compensation and the two-year lookback ESPP are consistently described as especially valuable.
-
Healthcare Strength — Health coverage is portrayed as robust, with comprehensive medical, dental, and vision options alongside mental health support and on-site care resources. Employer HSA contributions and wellness perks reinforce the depth of the offering.
-
Retirement Support — Retirement programs are depicted as strong, featuring a meaningful 401(k) match with Roth options and support for Mega Backdoor Roth contributions. These elements position long-term savings as a notable advantage of the total rewards package.
NVIDIA Insights
What We Do
NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”







