Principal Engineer, AIOps

Posted 2 Days Ago
Be an Early Applicant
Santa Clara, CA
Expert/Leader
Artificial Intelligence • Hardware • Robotics • Software • Metaverse
The Role
The Principal Engineer in AIOps will design, develop, and deploy AI-powered solutions for IT operations, working closely with engineers and data scientists. Responsibilities include applying AI techniques like machine learning and natural language processing to enhance IT infrastructure, leading product design and development, integrating AIOps tools, and ensuring customer satisfaction through effective collaboration.
Summary Generated by Built In

We are looking for an AIOps Principal Engineer who can design, develop, and deploy AI-powered solutions for IT operations. You will work with a team of engineers, data scientists, and domain experts to create and implement innovative applications that leverage NVIDIA's Observability, Infrastructure and Gen AI platforms. You will also collaborate with internal and external customers to understand their needs, define requirements, and deliver high-quality products.

What you'll be doing:

  • Lead the design, development, testing, and deployment of AIOps platform.

  • Apply machine learning, deep learning, natural language processing, and other AI techniques to solve IT operations challenges such as anomaly detection, root cause analysis, incident management, and automation.

  • Improve IT Infrastructure and Operations Management by defining and measuring AIOps metrics such as accuracy, reliability, scalability, performance, and efficiency.

  • Experience in implementing observability principles and practices such as monitoring, logging, tracing, and alerting.

  • Deep Knowledge in data science engineering such as data collection, data cleaning, data analysis, data modeling, and data visualization.

  • Expertise in integrating AIOps tools with IT operations management (ITOM) and IT service management (ITSM) systems, service desk, change management, configuration management, etc.

  • Demonstrate solid leadership skills and ability to lead and empower engineers and data scientists.

  • Design and communicate the AIOps roadmap, vision, and strategy to the team and the partners.

  • Collaborate effectively with customers, such as IT managers, business users, vendors, and partners, to ensure alignment and satisfaction.

  • Playing a pivotal role in harnessing AI, generative AI, and machine learning for Nvidia IT teams.

What we need to see:

  • Bachelor's degree or higher in computer science, engineering, or related field (or equivalent experience).

  • 15+ years of industry experience in extensive engineering projects, with a particular emphasis on infrastructure automation, distributed systems, and tool development for managing large-scale private or public cloud systems.

  • 5+ years of experience and understanding working with AIOps technologies and platforms.

  • Proficient in Python, TensorFlow, PyTorch, or other AI frameworks and libraries.

  • Proficiency in Python and Go programming; your coding and debugging expertise are pivotal to your success in this role.

  • Demonstrated commitment to sound software engineering principles and a strong willingness to acquire new skills.

  • Experience in working with IT systems, tools, and processes such as ITSM, ITOM, monitoring, logging, and alerting.

  • Ability to work independently and collaboratively in a fast-paced and dynamic environment.

  • Hands-On experience in designing and implementing end-to-end architecture and large-scale rollout of AIOps product.

  • Developed Gen AI applications using LLMs, RAG for incident diagnosis, identifying root causes and incident resolution.

Ways to stand out from the crowd:

  • Proficiency in developing and deploying generative AI solutions such as language model, chatbot, and conversational assistant.

  • Hands-On experience in Integrating workflow automation tools with AIOps for incident resolution and self-healing

  • Deep background and understanding of Machine Learning: developing, training, and applying machine learning models across large operational datasets.

  • Experience with pre-training & fine-tuning LLM models and working on ML frameworks such as SKLearn, XGBoost, PyTorch, Tensorflow.

  • Have hands-on experience with various AIOps platforms such as BigPanda, DataDog, Moogsoft, ITOM Health, Splunk, Elastic Stack, Dynatrace, New Relic, etc.

NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're a creative individual who thrives on achieving goals and enjoys a dynamic learning environment, then why not seize this opportunity? Apply today!

The base salary range is 248,000 USD - 385,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

Go
Python
The Company
HQ: Santa Clara, CA
21,960 Employees
On-site Workplace
Year Founded: 1993

What We Do

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.”

Similar Jobs

Pfizer Logo Pfizer

Digital Assistant Product Engineer

Artificial Intelligence • Healthtech • Machine Learning • Natural Language Processing • Biotech • Pharmaceutical
Hybrid
La Jolla, CA, USA
121990 Employees
98K-182K Annually

Atlassian Logo Atlassian

Principal Site Reliability Engineer

Cloud • Information Technology • Productivity • Security • Software • App development • Automation
Remote
San Francisco, CA, USA
11000 Employees
167K-269K Annually

ServiceNow Logo ServiceNow

Network Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Hybrid
San Diego, CA, USA
26000 Employees
92K-142K Annually

Datadog Logo Datadog

Services Architect 2

Artificial Intelligence • Cloud • Software • Cybersecurity
Hybrid
San Francisco, CA, USA
5000 Employees
86K-134K Annually

Similar Companies Hiring

TrainingPeaks (A Peaksware Company) Thumbnail
Software • Fitness
Louisville, CO
69 Employees
bet365 Thumbnail
Software • Gaming • eSports • Digital Media • Automation
Denver, Colorado
6100 Employees
Jobba Trade Technologies, Inc. Thumbnail
Software • Professional Services • Productivity • Information Technology • Cloud
Chicago, IL
45 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account