Senior Site Reliability Engineer, IaaS and PaaS

Sorry, this job was removed at 02:09 a.m. (CST) on Tuesday, Oct 21, 2025
4 Locations
In-Office or Remote
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Role

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

NVIDIA is looking for a passionate member to join our DGX Cloud Engineering Team as a Sr. Site Reliability Engineer. In this role, you will play a significant part in helping to craft and guide the future of AI & GPUs in the Cloud. NVIDIA DGX Cloud is a cloud platform tailored for AI tasks, enabling organizations to transition AI projects from development to deployment in the age of intelligent AI. Are you passionate about cloud software development and strive for quality? Do you pride yourself in building cloud-scale software systems? If so, join our team at NVIDIA, where we are dedicated to delivering GPU-powered services around the world!

What you'll be doing:

You will play a crucial role in ensuring the success of the Omniverse on DGX Cloud platform by helping to build our deployment infrastructure processes, creating world-class SRE measurement and creating automation tools to improve efficiency of operations, and maintaining a high standard of perfection in service operability and reliability.

  • Design, build, and implement scalable cloud-based systems for PaaS/IaaS.

  • Work closely with other teams on new products or features/improvements of existing products.

  • Develop, maintain and improve cloud deployment of our software.

  • Participate in the triage & resolution of complex infra-related issues

  • Collaborate with developers, QA and Product teams to establish, refine and streamline our software release process, software observability to ensure service operability, reliability, availability.

  • Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces

  • Develop, maintain and improve automation tools that can help improve efficiency of SRE operations

  • Practice balanced incident response and blameless postmortems

  • Be part of an on-call rotation to support production systems

What we need to see:

  • BS or MS in Computer Science or equivalent program (or equivalent experience).

  • 8+ years of hands-on software engineering or equivalent experience.

  • Experience programming with Go & Python, React.

  • Demonstrate understanding of cloud design in the areas of virtualization and global infrastructure, distributed systems, and security.

  • Expertise in Kubernetes (K8s) & KubeVirt and building RESTful web services.

  • Understanding of building AI Agentic solutions preferably Nvidia open source AI solutions. Demonstrate working experiences in SRE principles like metrics emission for observability, monitoring, alerting using logs, traces and metrics

  • Hands on experience working with Docker, Containers and Infrastructure as a Code like terraform deployment CI/CD.

  • Exhibit knowledge in concepts of working with CSPs, for example: AWS (Fargate, EC2, IAM, ECR, EKS, Route53 etc...), Azure etc.

Ways to stand out from the crowd:

  • Expertise in technologies such as StackStorm, OpenStack, Red Hat OpenShift, and AI DBs like Milvus.

  • A track record of solving complex problems with elegant solutions.

  • Demonstrate delivery of complex projects in previous roles.

  • Showcase ability in developing Frontend application with concepts of SSA, RBAC

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 270,250 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until October 18, 2025.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Similar Jobs

Corporate Tools LLC Logo Corporate Tools LLC

Content Creator

eCommerce • Legal Tech • Professional Services • Software • Data Privacy
Remote or Hybrid
5 Locations
60K-60K

Enverus Logo Enverus

Senior Product Advisor - 25350

Big Data • Information Technology • Software • Analytics • Energy
In-Office or Remote
4 Locations
180K-225K Annually

MetLife Logo MetLife

ADA Unit Leader - 13175

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
64K-85K Annually

MetLife Logo MetLife

Senior Product Claims Analyst - 13100

Fintech • Information Technology • Insurance • Financial Services • Big Data Analytics
Remote or Hybrid
United States
53K-71K Annually
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Santa Clara, CA
21,960 Employees
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Companies Hiring

Scrunch AI Thumbnail
Software • SEO • Marketing Tech • Information Technology • Artificial Intelligence
Salt Lake City, Utah
Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
10 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account