JPMorganChase

Site Reliability Engineer III

Sorry, this job was removed at 04:57 p.m. (CST) on Monday, Jan 26, 2026

Be an Early Applicant

Bengaluru, Bengaluru Urban, Karnataka

Hybrid

Financial Services

We’re one of the world’s biggest technology-driven companies

The Role

Job Description
There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.
As a Site Reliability Engineer III at JPMorgan Chase within the Asset & Wealth Management, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform.
Job responsibilities

Develops and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
Designs, implement and continuously improve monitoring systems including availability, latency and other salient metrics
Collaborates in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
Champions site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
Develops and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
Develops AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers.
Leads incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
Builds and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.
Engineers for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
Collaborates with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations. Implements Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.

Required qualifications, capabilities, and skills

Formal training or certification on software engineering concepts and 3+ years applied experience
Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform. Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
Experience with troubleshooting common networking technologies and issues
Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
Can effectively bridge the gap between ML engineers and infrastructure teams

Preferred qualifications, capabilities, and skills

Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference. Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.

View all jobs at JPMorganChase

View JPMorganChase Profile

Report Job

Similar Jobs

F5

Development Engineer

Cloud • Information Technology • Security • Software

In-Office

Bangalore, Bengaluru Urban, Karnataka, IND

5847 Employees

OutSystems

Senior Site Reliability Engineer

Software

In-Office

Bangalore, Bengaluru Urban, Karnataka, IND

1880 Employees

Dynatrace

Support Engineer

Artificial Intelligence • Big Data • Cloud • Information Technology • Software • Big Data Analytics • Automation

Hybrid

Bengaluru, Bengaluru Urban, Karnataka, IND

5200 Employees

Zscaler

Senior Manager, Software Development Engineering

Cloud • Information Technology • Security • Software • Cybersecurity

Easy Apply

Hybrid

Bangalore, Bengaluru, Karnataka, IND

8697 Employees

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: New York, NY

289,097 Employees

Year Founded: 1799

What We Do

JPMorgan Chase & Co. (NYSE: JPM) is a leading global financial services firm with assets of $3.7 trillion and operations worldwide. The firm is a leader in investment banking, financial services for consumers and small businesses, commercial banking, financial transaction processing, and asset management. A component of the Dow Jones Industrial Average, JPMorgan Chase & Co. serves millions of consumers in the United States and many of the world’s most prominent corporate, institutional and government clients under its J.P. Morgan and Chase brands.

Technology fuels every aspect of our company and is at the heart of everything we do. With over 50,000 technologists globally and an annual tech spend of $12 billion, we are dedicated to improving the design, analytics, development, coding, testing and application programming that goes into creating high quality software and new products.

Learn more about technology at our firm, explore resources from our Distinguished Engineers, AI & ML researchers, and other experts; access the latest episode of our TechTrends podcast, and more at www.jpmorgan.com/technology. Information about JPMorgan Chase & Co. is available at www.jpmorganchase.com.

©2023 JPMorgan Chase & Co. All rights reserved. JPMorgan Chase is an Equal Opportunity Employer, including Disability/Veterans.

Why Work With Us

Our technologists work on a diverse range of solutions that include strategic technology initiatives, big data, mobile, electronic payments, machine learning, cybersecurity, enterprise cloud development, and other state-of-the-art technologies.