About the Role
- The Senior Associate Reliability Operations role is critical in ensuring the continuous, reliable, and secure operation of our SaaS products, operating in a 24x7 support capacity. This role involves proactive monitoring, incident response, and collaboration with teams across the organization to maintain optimal service levels. The Senior Associate will participate in a rotating shift schedule to ensure high availability, rapid issue resolution, and support for key reliability initiatives. Senior Associate will serve as a key escalation point, mentor junior team members, and lead critical efforts to optimize operational workflows and systems.
Responsibilities:
- 24x7 Monitoring and Support: Oversee the health, performance, and availability of cloud-based SaaS infrastructure and applications, using monitoring tools like Prometheus and Grafana, and respond to alerts during assigned shifts. Alignment and adherence to organization process to maintain the SLA.
- Incident Management: Act as the first responder in a 24x7 rotation, managing and mitigating service disruptions, following standard incident procedures, and escalating issues to SMEs as needed.
- Deployments and Change Management: Manage deployment lifecycle of the applications. Proactively engage with SMEs to resolve deployment process issues or challenges.
- Troubleshooting and Resolution: Use diagnostic tools and scripts to resolve common issues in real-time and collaborate with cross-functional teams to analyze and address root causes.
- Service Health and Reliability: Assist in defining and refining SLAs, SLOs, and SLIs; perform routine checks and follow established runbooks to maintain consistent service reliability.
- Analysis and Reporting: Regularly review incident data to identify patterns, improve service resilience, and produce shift reports summarizing system health and resolved incidents.
- Documentation and Knowledge Base: Document incident resolutions, update runbooks, and contribute to an internal knowledge base to improve team response and efficiency.
- Continuous Improvement Initiatives: Participate in reliability enhancement projects, including automation, configuration management, and tools improvement.
- Collaboration: Communicate effectively with SMEs to relay critical incident information, insights, and preventive recommendations
- Mentorship: Work closely with team members to provide guidance during shifts and share insights on improving incident response.
Experience and Qualifications
- Education: B.Sc IT, B.Sc Computers, BCA or equivalent.
- Experience: 2-4 years of experience in reliability operations or related 24x7 support role within SaaS or cloud environments
Skills
- Proficiency in monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
- Ability to remain composed in high-stakes situations and resolve incidents promptly.
- Strong verbal and written communication skills to document and relay incident information effectively.
Shift Information
- 24x7 Rotational Shifts: This role requires availability to work rotating shifts, including nights, weekends, and holidays, to ensure 24x7 support coverage.
Top Skills
What We Do
Founded in 2015, Zeta is a provider of next-gen credit card processing platform. Zeta’s cloud-native and fully API-enabled stack offers a comprehensive range of capabilities, including processing, issuing, lending, core banking, fraud detection, and loyalty programs. With a strong focus on technology, Zeta has over 1700+ employees and contractors, with more than 70% dedicated to technology roles. Operating across the US, UK, Middle East, and Asia, Zeta has served a global customer base of 35+ clients who have issued over 15 million cards on Zeta's platform to date. Backed by prominent investors such as Softbank Vision Fund 2 and Mastercard, Zeta has raised $280 million, at a valuation of $1.5 billion.







