AI Senior Staff Systems Engineer

Reposted 2 Days Ago
Be an Early Applicant
San Jose, CA, USA
In-Office
137K-254K Annually
Senior level
Artificial Intelligence • Cloud • Hardware • Software • Semiconductor
The Role
Lead the development and management of AI infrastructure, ensuring optimal performance of systems and providing mentorship to engineers while deploying advanced AI models.
Summary Generated by Built In
At Cadence, we hire and develop leaders and innovators who want to make an impact on the world of technology.

We are seeking a highly skilled and experienced AI Systems Engineer to join our team. This is a hands-on, senior individual contributor role that will be pivotal in leading the development, operations, and support of our entire AI infrastructure. You will be responsible for the entire lifecycle of our AI systems, from architecting and building high-performance GPU clusters to deploying and optimizing our most advanced AI models and agentic services.

Responsibilities 

  • AI Infrastructure Architecture & Strategy: Lead the design and implementation of our next-generation AI infrastructure to support our Agentic AI initiatives. You will define the technical strategy for our on-premise GPU clusters, storage solutions, and networking to ensure optimal performance, scalability, and reliability for all our AI workloads.

  • Cloud AI Service Integration: Support and secure the use of public cloud AI services, including Azure OpenAI services and Google Cloud Platform (GCP) services like Gemini. This includes managing secure access, monitoring usage, and tracking billing to ensure cost-effectiveness. You will also have hands-on experience supporting compute, GPUs, and AI services on both GCP and Azure.

  • Hands-on GPU Cluster Management: Take a leadership role in the configuration, installation, and optimization of GPU server clusters. This includes advanced troubleshooting of hardware and software, performance tuning, and implementing best practices for cluster utilization and resource management. You will be an expert in administering job schedulers like LSF in a production environment, including integration with Docker for containerized job submission.

  • Full-Stack AI Tech Stack Development & Operations: Architect and deploy a robust and scalable AI tech stack. You will be responsible for the end-to-end operational lifecycle, including setting up and managing deep learning frameworks (PyTorch, TensorFlow), containerization with Docker and Kubernetes, and implementing CI/CD pipelines for AI model development.

  • Advanced LLM Deployment & Optimization: Lead the deployment, serving, and optimization of Large Language Models (LLMs). You will be an expert in techniques such as model quantization, distillation, and using high-performance serving frameworks (e.g., vLLM, TGI, TensorRT-LLM) to maximize inference throughput and minimize latency.

  • Agentic AI Workflow & Service Engineering: Architect and build production-grade Agentic AI workflows and services. You will be responsible for the technical design and implementation of systems that integrate LLMs with external tools, APIs, and databases, and will mentor other engineers on building robust and scalable AI agent applications.

  • Automation & Monitoring: Develop and maintain automation scripts using languages like Python, Bash, or Perl to streamline system maintenance, deployment, and reporting. Implement and manage monitoring solutions for system health, job statuses, GPU utilization, and container performance to proactively identify and resolve issues.

  • AI Systems Support & Mentorship: Act as the final escalation point for the most complex technical issues related to our AI infrastructure. You will also serve as a technical leader and mentor to other engineers, providing guidance on best practices in AI systems engineering, performance tuning, and operational excellence.

  • Security and Compliance: Develop and implement security best practices for our AI systems and data, ensuring compliance with relevant regulations and protecting our intellectual property.

Required Skills and Qualifications 

  • 10+ years of experience in a senior technical role, with at least 5 years focused on building and operating high-performance computing or AI infrastructure. Proven track record as a Principal or Senior Staff Engineer.

  • Expert-level knowledge of NVIDIA GPU architecture and technologies like CUDA and cuDNN. Extensive experience with multi-GPU and multi-node training and inference.

  • Proven experience with public cloud AI services, specifically managing access, usage, and billing for Azure OpenAI and Google Cloud Platform (GCP) services.

  • Extensive hands-on experience with Docker: image management, container orchestration, and troubleshooting.

  • Proficiency in scripting languages such as Python, Bash, or Perl.

  • Deep expertise in Linux system administration (RHEL preferred), including networking, storage, and performance tuning. 

  • Familiarity with user authentication and integration using systems like LDAP or Active Directory.

  • Strong problem-solving and communication skills with the ability to work in a multi-platform, cross-functional, and geographically distributed team.

Preferred/Bonus Skills 

  • Understanding of AI job profiling and tuning (memory, GPU, I/O).

  • Experience administering LSF clusters in a production or research environment. Familiarity with other job schedulers like Slurm is a plus.

  • Experience with LSF Docker integration and job submission using container images.

  • Experience with macOS/AppleSilicon system admin tasks and troubleshooting.

The annual salary range for California is $136,500 to $253,500. You may also be eligible to receive incentive compensation: bonus, equity, and benefits. Sales positions generally offer a competitive On Target Earnings (OTE) incentive compensation structure. Please note that the salary range is a guideline and compensation may vary based on factors such as qualifications, skill level, competencies and work location. Our benefits programs include: paid vacation and paid holidays, 401(k) plan with employer match, employee stock purchase plan, a variety of medical, dental and vision plan options, and more.

We’re doing work that matters. Help us solve what others can’t.

Skills Required

  • 10+ years of experience in a senior technical role
  • 5+ years focused on building and operating high-performance computing or AI infrastructure
  • Expert-level knowledge of NVIDIA GPU architecture
  • Extensive experience with public cloud AI services, specifically Azure OpenAI and GCP
  • Hands-on experience with Docker
  • Proficiency in scripting languages like Python, Bash, or Perl
  • Deep expertise in Linux system administration (RHEL preferred)

Cadence Design Systems Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Cadence Design Systems and has not been reviewed or approved by Cadence Design Systems.

  • Equity Value & Accessibility A discounted ESPP with a lookback feature and equity included in total compensation make ownership broadly accessible and potentially meaningful. Structured compensation at an industry leader adds predictability to equity participation.
  • Healthcare Strength Medical, dental, and vision coverage are described as solid, with mental‑health/EAP and fertility support enhancing the offering. The breadth across core care and family‑building needs strengthens the healthcare package.
  • Leave & Time Off Breadth Global Recharge Days, volunteer time off, and companywide breaks indicate a comprehensive time‑off framework. In addition, many salaried roles are described as having flexible or generous PTO policies.

Cadence Design Systems Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Jose, CA
8,216 Employees
Year Founded: 1988

What We Do

Cadence enables electronic systems and semiconductor companies to create the innovative end products that are transforming the way people live, work and play. Cadence® software, hardware and IP are used by customers to deliver products to market faster. The company's Intelligent System Design strategy helps customers develop differentiated products—from chips to boards to intelligent systems—in mobile, consumer, cloud, data center, automotive, aerospace, IoT, industrial and other market segments. Cadence is listed as one of Fortune Magazine's 100 Best Companies to Work For.

Similar Jobs

ServiceNow Logo ServiceNow

Software Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Remote or Hybrid
Santa Clara, CA, USA
29000 Employees
201K-352K Annually

ServiceNow Logo ServiceNow

Staff Software Engineer

Artificial Intelligence • Cloud • HR Tech • Information Technology • Productivity • Software • Automation
Hybrid
Mountain View, CA, USA
29000 Employees

DigitalOcean Logo DigitalOcean

Senior Solutions Architect

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)
In-Office
San Francisco, CA, USA
1400 Employees
150K-182K Annually
Easy Apply
Remote or Hybrid
5 Locations
4405 Employees
107K-170K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account