Site Reliability Manager

Sorry, this job was removed at 02:59 p.m. (CST) on Thursday, May 08, 2025
Be an Early Applicant
Block Noida Authority Office, Sector 6, Gautam Buddha Nagar, Uttar Pradesh
In-Office
Fintech • Insurance • Software • Analytics
The Role

Clearwater Analytics is seeking a passionate and detail-oriented SRE Manager to lead our dedicated team in maintaining the reliability and performance of our cloud-based platforms. The SRE Manager will oversee a talented team focused on achieving excellence in cloud operations and observability, fostering a culture of collaboration, innovation, and accountability while upholding the highest standards in service reliability.

Key Responsibilities:

  • Lead the SRE team to ensure the reliability and performance of our cloud-based monitoring services.
  • Design, implement, and maintain scalable, secure, and resilient cloud infrastructure solutions in AWS, Azure, and GCP.
  • Collaborate with cross-functional teams to define cloud architecture strategies that align with business objectives and drive innovation.
  • Drive automation efforts for cloud deployment, configuration management, and monitoring to enhance operational efficiency.
  • Develop and enforce best practices for Infrastructure as Code (IaC) using tools such as Terraform, Ansible, or CloudFormation.
  • Manage cloud costs and optimize infrastructure utilization for maximum efficiency.
  • Ensure compliance with security standards and best practices in cloud service deployment and configuration.
  • Conduct regular audits and assessments of cloud resources and services to ensure optimal performance and security.
  • Lead initiatives related to SLIs, SLOs, and error budgets in collaboration with the R&D team to proactively manage platform stability.
  • Enhance system observability through effective monitoring, alerting, and metrics reporting.
  • Implement observability solutions (logs, metrics, traces) for cloud foundational platforms and promote best practices in reliability engineering.
  • Mentor and build a high-performing team to achieve both personal and organizational goals.

Requirements:

  • Bachelor’s or Master’s degree in Computer Science or a related field.
  • Over 12 years of experience managing services in large-scale environments, with at least 3 years in a leadership role.
  • 5+ years of SRE experience focusing on telemetry, observability, self-healing solutions, and platform automation.
  • Proficiency in several programming languages, including Java, Python, and JavaScript (5+ years of experience).
  • Hands-on experience with build and release tools such as Jenkins, Sonar, Artifactory, JIRA, and GitLab, along with a strong CI/CD understanding.
  • Familiarity with public cloud environments like AWS, Azure, and GCP (5+ years).
  • Experience with observability tools and frameworks, such as Dynatrace, Prometheus, Grafana, and AWS CloudWatch.
  • Strong incident response and management skills, demonstrating a proactive and strategic approach to system reliability.
  • Hands-on experience with Infrastructure as Code (IaC) and configuration management tools (e.g., Terraform, Puppet).
  • Demonstrated integrity, strong ownership, and excellent communication and collaboration skills.
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
1,150 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account