The Role:
- The Associate Manager - Reliability Operations leads a team to rigorously uphold service level objectives (SLOs) through expert alert management, SOP-compliant ticket escalations, and coordinated support for SRE-signed deployments across multiple sites.
- This role drives operational accountability, fosters seamless SRE partnerships, and ensures production stability in a high stakes 24x7 SaaS environment
Responsibilities
- Drives SLO adherence by implementing advanced metric monitoring, enforcing error budgets, and spearheading proactive initiatives to prevent breaches and elevate system reliability.
- Ensures all alerts receive immediate acknowledgment, with tickets escalated to SRE teams for any issues lacking defined SOPs, systematically reducing escalations, downtime, and MTTR.
- Coordinates standard deployments across sites following SRE sign-off, overseeing logistics, real-time rollout health monitoring, and rigorous post-deployment SLO validation.
- Collaborates strategically with SRE teams on deployment planning, comprehensive risk assessments, troubleshooting, and post-release optimizations for flawless execution and rapid recovery.
- Oversees and refines team processes for alert triage, SOP documentation/updates, and knowledge sharing, integrating automation to minimize manual toil and enhance operational resilience.
- Mentors staff on SLO-driven decision-making, conducts in-depth audits of alert/ticket workflows, analyses trends in operational data, and delivers actionable reliability KPI reports to stakeholders.
Skills
- Proven track record in 24x7 SaaS/cloud support operations, handling high-pressure incidents and customer-impacting events.
- Strong proficiency in monitoring/incident tools (Prometheus, Grafana, Splunk, PagerDuty) and ticketing systems.
- Effective leadership and people management, with excellent communication for technical/non-technical collaboration.
- Analytical skills to interpret operational data, identify trends, and drive process recommendations.
Experience and Qualifications
- Familiarity with ITIL frameworks, SRE principles (e.g., error budgets, toil reduction), and cloud platforms (AWS, Azure, GCP).
- Experience with process improvement methodologies and shift handoff protocols.
- Knowledge of basic reliability concepts and observability stacks.
- Education: Bachelor's degree in Information Technology, Business, or related field; relevant IT certifications (e.g., ITIL Foundation) are a plus.
- Experience: 6-8 years in operations support, reliability operations, or IT service management, including 2+ years in supervisory roles managing 24x7 teams.
Shift Information
- 24x7 Operational Oversight: Role with on-call and shift responsibilities for escalations; provides oversight for 24x7 team operations, including shift scheduling and off-hour incident coordination.
Skills Required
- 6-8 years in operations support or IT service management
- 2+ years in supervisory roles managing 24x7 teams
- Bachelor's degree in Information Technology, Business, or related field
- Familiarity with ITIL frameworks and SRE principles
- Strong proficiency in monitoring and incident tools
What We Do
Founded in 2015, Zeta is a provider of next-gen credit card processing platform. Zeta’s cloud-native and fully API-enabled stack offers a comprehensive range of capabilities, including processing, issuing, lending, core banking, fraud detection, and loyalty programs. With a strong focus on technology, Zeta has over 1700+ employees and contractors, with more than 70% dedicated to technology roles. Operating across the US, UK, Middle East, and Asia, Zeta has served a global customer base of 35+ clients who have issued over 15 million cards on Zeta's platform to date. Backed by prominent investors such as Softbank Vision Fund 2 and Mastercard, Zeta has raised $280 million, at a valuation of $1.5 billion.






