Engineer II, Site Reliability

Posted 4 Days Ago
Be an Early Applicant
2 Locations
In-Office or Remote
Junior
Information Technology
The Role
The Site Reliability Engineer will enhance platform reliability, scalability, manage service health, and automate operational workflows, ensuring exceptional system performance and customer experience in a regulated environment.
Summary Generated by Built In

Job Description: Site Reliability Engineer (SRE)
Role Overview
The Software Engineer / Site Reliability Engineer (SRE) will play a critical role in driving reliability, scalability, and performance for the Banking Solutions, Payments, and Capital Markets platforms. This role blends core SRE principles, performance engineering, and service health management to support large-scale, mission-critical systems.
The ideal candidate will help modernize platforms through automation-first practices, data-driven reliability metrics, and proactive performance optimization, ensuring exceptional customer experience and business continuity in a highly regulated environment.

What You Will Be Doing
Core SRE & Reliability Engineering

Design, implement, and operate highly available, resilient, and scalable systems aligned with SRE best practices.
Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance reliability and delivery velocity.
Build and maintain service health dashboards to provide real-time visibility into platform stability and customer experience.
Reduce toil through extensive automation of operational workflows, alerts, and remediation activities.

Monitoring, Observability & Service Health

Design and maintain end-to-end monitoring and observability solutions covering infrastructure, applications, APIs, and user journeys.
Implement advanced alerting strategies to reduce noise and improve mean time to detect (MTTD) and mean time to resolution (MTTR).
Leverage metrics, logs, and traces to drive root cause analysis and proactive incident prevention.
Enable reliability reporting for stakeholders using SLO compliance and service health metrics.

Performance Engineering & Testing

Lead performance engineering initiatives, including load testing, stress testing, endurance testing, and capacity validation.
Identify performance bottlenecks across application, middleware, database, and infrastructure layers.
Conduct capacity planning and performance tuning to support business growth and peak traffic scenarios.
Partner with development and QA teams to embed performance testing into CI/CD pipelines.

Incident Management & Operations

Lead and participate in incident response activities, including triage, mitigation, recovery, and post-incident reviews.
Drive blameless post-mortems and ensure corrective actions are tracked to completion.
Participate in on-call rotations, providing 24x7 support for critical production systems.
Continuously improve operational readiness and resilience.

Automation, CI/CD & Cloud Operations

Design and manage deployment pipelines, configuration management, and environment consistency across lower and production environments.
Implement Infrastructure as Code (IaC) practices for repeatable and secure cloud provisioning.
Collaborate with DevOps teams to improve deployment reliability, rollback mechanisms, and release safety.
Develop and test disaster recovery plans, backup strategies, and failover mechanisms.

Collaboration & Governance

Work closely with Development, QA, DevOps, Security, and Product teams to align on reliability and performance goals.
Ensure platforms meet security, compliance, and regulatory requirements common in financial services.
Act as a reliability and performance advocate throughout the SDLC.

What You Bring
Required Skills & Experience

Strong experience in Core SRE practices, including reliability engineering, incident management, and automation.
Proven hands-on experience in Performance Engineering / Performance Testing for large-scale distributed systems.
Deep understanding and implementation experience with SLI / SLO / Error Budget frameworks.
Proficiency in cloud platforms (AWS, Azure, or Google Cloud).
Hands-on experience with containerization and orchestration (Docker, Kubernetes).
Strong background in monitoring, observability, and logging

Tools such as Prometheus, Grafana, Datadog, Splunk, ELK Stack.

Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
Proficiency in scripting and automation using Python, Bash, Terraform, Ansible.
Strong troubleshooting skills across application, infrastructure, and network layers.
Experience designing and running incident response and post-mortem reviews.
Ownership mindset with accountability for service reliability and customer outcomes.
Excellent communication, collaboration, and stakeholder management skills.

Nice to Have (SRE+ Skills)

Experience with Keptn or similar tools for automated SLO-based quality gates and continuous delivery.
Programming experience in Java, especially for debugging, performance profiling, or building automation tools.
Familiarity with chaos engineering practices and tools.
Experience working in banking, payments, or capital markets domains.
Knowledge of security best practices and regulatory compliance in enterprise environment

Responsibilities

What You Will Be Doing
Core SRE & Reliability Engineering

Design, implement, and operate highly available, resilient, and scalable systems aligned with SRE best practices.
Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance reliability and delivery velocity.
Build and maintain service health dashboards to provide real-time visibility into platform stability and customer experience.
Reduce toil through extensive automation of operational workflows, alerts, and remediation activities.

Monitoring, Observability & Service Health

Design and maintain end-to-end monitoring and observability solutions covering infrastructure, applications, APIs, and user journeys.
Implement advanced alerting strategies to reduce noise and improve mean time to detect (MTTD) and mean time to resolution (MTTR).
Leverage metrics, logs, and traces to drive root cause analysis and proactive incident prevention.
Enable reliability reporting for stakeholders using SLO compliance and service health metrics.

Performance Engineering & Testing

Lead performance engineering initiatives, including load testing, stress testing, endurance testing, and capacity validation.
Identify performance bottlenecks across application, middleware, database, and infrastructure layers.
Conduct capacity planning and performance tuning to support business growth and peak traffic scenarios.
Partner with development and QA teams to embed performance testing into CI/CD pipelines.

Incident Management & Operations

Lead and participate in incident response activities, including triage, mitigation, recovery, and post-incident reviews.
Drive blameless post-mortems and ensure corrective actions are tracked to completion.
Participate in on-call rotations, providing 24x7 support for critical production systems.
Continuously improve operational readiness and resilience.

Automation, CI/CD & Cloud Operations

Design and manage deployment pipelines, configuration management, and environment consistency across lower and production environments.
Implement Infrastructure as Code (IaC) practices for repeatable and secure cloud provisioning.
Collaborate with DevOps teams to improve deployment reliability, rollback mechanisms, and release safety.
Develop and test disaster recovery plans, backup strategies, and failover mechanisms.

Collaboration & Governance

Work closely with Development, QA, DevOps, Security, and Product teams to align on reliability and performance goals.
Ensure platforms meet security, compliance, and regulatory requirements common in financial services.
Act as a reliability and performance advocate throughout the SDLC.

Qualifications

Required Skills & Experience

Strong experience in Core SRE practices, including reliability engineering, incident management, and automation.
Proven hands-on experience in Performance Engineering / Performance Testing for large-scale distributed systems.
Deep understanding and implementation experience with SLI / SLO / Error Budget frameworks.
Proficiency in cloud platforms (AWS, Azure, or Google Cloud).
Hands-on experience with containerization and orchestration (Docker, Kubernetes).
Strong background in monitoring, observability, and logging

Tools such as Prometheus, Grafana, Datadog, Splunk, ELK Stack.

Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps).
Proficiency in scripting and automation using Python, Bash, Terraform, Ansible.
Strong troubleshooting skills across application, infrastructure, and network layers.
Experience designing and running incident response and post-mortem reviews.
Ownership mindset with accountability for service reliability and customer outcomes.
Excellent communication, collaboration, and stakeholder management skills.

Nice to Have (SRE+ Skills)

Experience with Keptn or similar tools for automated SLO-based quality gates and continuous delivery.
Programming experience in Java, especially for debugging, performance profiling, or building automation tools.
Familiarity with chaos engineering practices and tools.
Experience working in banking, payments, or capital markets domains.
Knowledge of security best practices and regulatory compliance in enterprise environment

About UsAt Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus.
Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. Explore Life at Zensar and join us to Grow. Own. Achieve. Learn. to be the best version of yourself.
We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.

Top Skills

Ansible
AWS
Azure
Azure Devops
Bash
Datadog
Docker
Elk Stack
Gitlab Ci/Cd
GCP
Grafana
Java
Jenkins
Kubernetes
Prometheus
Python
Splunk
Terraform
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Pune, Maharashtra
10,000 Employees
Year Founded: 2001

What We Do

Zensar is a leading experience, engineering, and technology solutions company. We conceptualize, build, and manage digital products for Forbes Global 2000 clients across the hi-tech engineering, banking and financial services, insurance, manufacturing and consumer services verticals. With proven excellence across five core areas, including experience services, advanced engineering services, data engineering and analytics, foundation services, and application services, our solutions leverage industry-leading platforms to help our clients be competitive, agile, and disruptive while moving with velocity through change and opportunity. Zensar’s expansive ecosystem of 60+ technology partners, including Oracle, Salesforce, SAP, Guidewire, Automation Anywhere, Adobe, and UiPath, enables us to deliver comprehensive solutions to clients, facilitating seamless integration and allowing them to leverage cutting-edge technologies and tools for enhanced business outcomes. Zensar is part of the USD 4.4 billion RPG Group. With headquarters in Pune, India, our 10,500+ employees, representing over 50 nationalities, work from 30+ locations across North America, UK/Europe, and South Africa. Visit us at www.zensar.com

Similar Jobs

Easy Apply
Remote
India
359 Employees
3K-6K Hourly
Easy Apply
Remote
India
5395 Employees
Easy Apply
Remote
India
1891 Employees

Juniper Square Logo Juniper Square

Site Reliability Engineer

Fintech • Real Estate • Software
Remote
India
217 Employees

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account