Site Reliability Engineer AIML - Associate

Reposted 11 Hours Ago
Be an Early Applicant
Bengaluru, Karnataka
Hybrid
Mid level
Financial Services
We’re one of the world’s biggest technology-driven companies
The Role
Join a team focusing on enhancing AI system reliability. Responsibilities include developing monitoring systems, collaborating on infrastructure design, and leading incident responses for AI services to ensure performance and availability.
Summary Generated by Built In
Job Description
Are you looking for an exciting opportunity to join a dynamic and growing team in a fast paced and challenging area? This is a unique opportunity for you to work in our team to partner with the Business to provide a comprehensive view.
As a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery.
Job Responsibilities:
  • Develop and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
  • Design, implement and continuously improve monitoring systems including availability, latency and other salient metrics
  • Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
  • Champion site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
  • Champion site reliability culture and practices and exerts technical influence throughout your team
  • Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
  • Develop AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers.
    Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
    Build and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.
  • Engineer for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
  • Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations.
  • Implement Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.

Required qualifications, capabilities, and skills:
  • Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
  • Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
  • Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
  • Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
  • Experience with troubleshooting common networking technologies and issues
  • Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
  • Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
  • Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
    Can effectively bridge the gap between ML engineers and infrastructure teams
    Have excellent communication skills

Preferred qualifications, capabilities, and skills
  • Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
  • Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
  • Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
  • Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
    Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.
  • Understand ML model deployment strategies and their reliability implications
  • Have contributed to open-source infrastructure or ML tooling
  • Have experience with chaos engineering and systematic resilience testing

Top Skills

Datadog
Docker
Dynatrace
Ecs
Gitlab
Grafana
Jenkins
Kubernetes
Prometheus
Splunk
Terraform

What the Team is Saying

Nick S.
Lupe C.
Edwin T.
Dawn T.
Meng M.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: New York, NY
289,097 Employees
Year Founded: 1799

What We Do

JPMorgan Chase & Co. (NYSE: JPM) is a leading global financial services firm with assets of $3.7 trillion and operations worldwide. The firm is a leader in investment banking, financial services for consumers and small businesses, commercial banking, financial transaction processing, and asset management. A component of the Dow Jones Industrial Average, JPMorgan Chase & Co. serves millions of consumers in the United States and many of the world’s most prominent corporate, institutional and government clients under its J.P. Morgan and Chase brands.

Technology fuels every aspect of our company and is at the heart of everything we do. With over 50,000 technologists globally and an annual tech spend of $12 billion, we are dedicated to improving the design, analytics, development, coding, testing and application programming that goes into creating high quality software and new products.

Learn more about technology at our firm, explore resources from our Distinguished Engineers, AI & ML researchers, and other experts; access the latest episode of our TechTrends podcast, and more at www.jpmorgan.com/technology. Information about JPMorgan Chase & Co. is available at www.jpmorganchase.com.

©2023 JPMorgan Chase & Co. All rights reserved. JPMorgan Chase is an Equal Opportunity Employer, including Disability/Veterans.

Why Work With Us

Our technologists work on a diverse range of solutions that include strategic technology initiatives, big data, mobile, electronic payments, machine learning, cybersecurity, enterprise cloud development, and other state-of-the-art technologies.

Gallery

Gallery
Gallery
Gallery
Gallery
Gallery
Gallery

JPMorganChase Teams

Team
Product + Tech
About our Teams

JPMorganChase Offices

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

Typical time on-site: Flexible
Company Office Image
HQNew York, NY
SG
Bengaluru, Karnataka
Bournemouth, GB
Buenos Aires, Avaya
Chicago, IL
Dallas, TX
Dublin, IE
Glasgow, GB
Houston, TX
Hyderabad, Telangana
London, GB
Mumbai, Maharashtra
New York, NY
Philadelphia, PA
San Francisco, CA
Tampa, FL
Westerville, OH
Wilmington, DE
Learn more

Similar Jobs

Hybrid
Bengaluru, Karnataka, IND
289097 Employees
Hybrid
Bengaluru, Karnataka, IND
289097 Employees
Hybrid
Bengaluru, Karnataka, IND
289097 Employees

JPMorganChase Logo JPMorganChase

Controller

Financial Services
Hybrid
Bengaluru, Karnataka, IND
289097 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account