Cisco

Platform Engineer – OpenShift+ AI-ML SRE | 4+ years

Posted Yesterday

Be an Early Applicant

Bangalore, Bengaluru Urban, Karnataka, IND

In-Office

Mid level

Cloud • Information Technology • Internet of Things • Professional Services • Software

The Role

Design, deploy, and operate highly available Red Hat OpenShift-based ML platforms supporting LLMs and GPU workloads. Implement SRE practices, automation in Golang/Python, IaC and CI/CD, observability, incident response and RCA, cluster lifecycle management, and 16x5 on-call support while collaborating with global teams.

Summary Generated by Built In

Meet the Team

You will be pivotal in contributing to the team responsible for designing and developing the next generation of scalable Kubernetes infrastructure with machine learning platforms that support both traditional ML and state-of-the-art Large Language Models (LLMs). This is a position for expert engineers where you will lead the technical direction, ensuring the performance, reliability, and scalability of AI systems while collaborating closely with data scientists, researchers, and other engineering teams.

Your Impact

The ideal candidate will have strong hands-on expertise in Red Hat OpenShift, proficiency in Golang and/or Python, and a passion for delivering highly reliable, scalable, and secure infrastructure. Hands on experience to AI technologies such as Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) & GPU frameworks.

Core Responsibilities

Design, deploy, administer, and optimize highly available Red Hat OpenShift platforms.
Implement and drive Site Reliability Engineering (SRE) practices to ensure platform reliability, scalability, and operational excellence.
Develop automation tools, operators, and platform services using Golang and/or Python.
Manage cluster lifecycle activities including upgrades, patching, capacity planning, and performance tuning.
Build and maintain CI/CD pipelines and Infrastructure as Code (IaC) solutions.
Implement and maintain observability solutions including logging, metrics, tracing, and alerting.
Monitor platform health and proactively identify and resolve reliability and performance issues.
Solve production incidents, perform root cause analysis (RCA), and drive preventive actions.
Collaborate closely with application and DevOps teams to improve deployment processes and platform adoption.
Ensure platform security, compliance, and consistency to organizational standards and procedures.
Participate in 16×5 on-call support rotation, providing timely response and resolution for production incidents and ensuring service availability.
Continuously evaluate and accept emerging technologies to enhance platform capabilities and operational efficiency.
Collaborate with global cross-functional teams across regions to support platform initiatives, drive operational excellence, and ensure seamless delivery of services and solutions.
GPU as a Service Platform offering and provide client support for hosting AI/ML workload powered by GPU

Minimum Qualifications / Requirement

4+ years of experience in Site Reliability Engineering, Platform Engineering, DevOps, or related roles.
Strong hands-on experience with Red Hat OpenShift administration, operations, and troubleshooting.
Proficiency in Golang and/or Python for automation and platform engineering.
Experience with container technologies such as Docker and container runtimes.
Strong understanding of Linux systems, networking, and distributed systems concepts.
Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or equivalent.
Experience with CI/CD tools such as Jenkins, GitLab CI, ArgoCD, Tekton, or similar.
Proven experience with observability tools such as Prometheus, Grafana, ELK, Loki, Jaeger, and OpenTelemetry.
Strong troubleshooting, debugging, and incident management capabilities.
Hands on experience to AI/ML platforms, Large Language Models (LLMs), and Retrieval-Augmented Generation (RAG) & GPU architectures.
Experience with AI frameworks such as LangChain, LlamaIndex, or vector databases.
Ability to support and participate in 16×5 on-call rotations for critical production environments

Preferred Qualifications / Requirements

Familiarity with public cloud platforms (AWS, Azure, or GCP)
Familiarity with GitOps methodologies and tools.
Experience with service mesh technologies such as Istio.
Knowledge of container and platform security standards.
Reliability-first and automation-driven attitude.
Strong analytical and problem-solving skills.
Ability to work effectively in a fast-paced production environment.
Excellent communication and partnership skills.
Ownership, accountability, and a customer-focused approach.

Why Cisco?

At Cisco, we’re revolutionizing how data and infrastructure connect and protect organizations in the AI era – and beyond. We’ve been innovating fearlessly for 40 years to create solutions that power how humans and technology work together across the physical and digital worlds. These solutions provide customers with unparalleled security, visibility, and insights across the entire digital footprint.

Fueled by the depth and breadth of our technology, we experiment and create meaningful solutions. Add to that our worldwide network of doers and experts, and you’ll see that the opportunities to grow and build are limitless. We work as a team, collaborating with empathy to make really big things happen on a global scale. Because our solutions are everywhere, our impact is everywhere.

We are Cisco, and our power starts with you.

Skills Required

4+ years in Site Reliability Engineering, Platform Engineering, DevOps, or related roles
Hands-on experience with Red Hat OpenShift administration, operations, and troubleshooting
Proficiency in Golang and/or Python for automation and platform engineering
Experience with container technologies such as Docker and container runtimes
Strong understanding of Linux systems, networking, and distributed systems concepts
Experience with Infrastructure as Code tools such as Terraform, Ansible, or equivalent
Experience with CI/CD tools such as Jenkins, GitLab CI, ArgoCD, Tekton, or similar
Proven experience with observability tools such as Prometheus, Grafana, ELK, Loki, Jaeger, and OpenTelemetry
Strong troubleshooting, debugging, and incident management capabilities
Hands-on experience with AI/ML platforms, Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and GPU architectures
Experience with AI frameworks such as LangChain, LlamaIndex, or vector databases
Ability to support and participate in 16x5 on-call rotations for critical production environments
Familiarity with public cloud platforms (AWS, Azure, or GCP)
Familiarity with GitOps methodologies and tools
Experience with service mesh technologies such as Istio
Knowledge of container and platform security standards
Reliability-first and automation-driven attitude, strong analytical and communication skills

Cisco Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Cisco and has not been reviewed or approved by Cisco.

Healthcare Strength — Comprehensive medical, dental, and vision coverage, mental health support via an EAP, and access to on-site or virtual health centers indicate robust healthcare offerings. Wellness programs, fitness resources, and specialized services further reinforce coverage depth.
Leave & Time Off Breadth — Generous PTO, a global minimum for paid parental leave, and unique programs like company-wide recharge days and paid volunteer time expand time-away options. Additional offerings such as Critical Time Off and adoption assistance add flexibility for life events.
Equity Value & Accessibility — Restricted stock units and a discounted employee stock purchase plan are meaningful elements of total compensation. The prominence of equity can materially augment overall pay packages alongside salary and bonuses.

Learn more about Cisco's Compensation & Benefits →

Cisco Insights

What's It Like to Work at Cisco? Cisco Culture & Values Cisco Career Growth & Development What's the Work-Life Balance Like at Cisco? Cisco Leadership & Management Cisco Company Growth, Stability & Outlook

View all jobs at Cisco

View Cisco Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Jose, CA

77,500 Employees

Year Founded: 1984

What We Do

Cisco (NASDAQ: CSCO) enables people to make powerful connections--whether in business, education, philanthropy, or creativity. Cisco hardware, software, and service offerings are used to create the Internet solutions that make networks possible--providing easy access to information anywhere, at any time. Cisco was founded in 1984 by a small group of computer scientists from Stanford University. Since the company's inception, Cisco engineers have been leaders in the development of Internet Protocol (IP)-based networking technologies. Today, with more than 71,000 employees worldwide, this tradition of innovation continues with industry-leading products and solutions in the company's core development areas of routing and switching, as well as in advanced technologies such as home networking, IP telephony, optical networking, security, storage area networking, and wireless technology. In addition to its products, Cisco provides a broad range of service offerings, including technical support and advanced services. Cisco sells its products and services, both directly through its own sales force as well as through its channel partners, to large enterprises, commercial businesses, service providers, and consumers.