The Role
Clearwater Analytics is seeking a passionate and detail-oriented SRE Manager to lead our dedicated team in maintaining the reliability and performance of our cloud-based platforms. The SRE Manager will oversee a talented team focused on achieving excellence in cloud operations and observability, fostering a culture of collaboration, innovation, and accountability while upholding the highest standards in service reliability.
Key Responsibilities:
- Lead the SRE team to ensure the reliability and performance of our cloud-based monitoring services.
- Design, implement, and maintain scalable, secure, and resilient cloud infrastructure solutions in AWS, Azure, and GCP.
- Collaborate with cross-functional teams to define cloud architecture strategies that align with business objectives and drive innovation.
- Drive automation efforts for cloud deployment, configuration management, and monitoring to enhance operational efficiency.
- Develop and enforce best practices for Infrastructure as Code (IaC) using tools such as Terraform, Ansible, or CloudFormation.
- Manage cloud costs and optimize infrastructure utilization for maximum efficiency.
- Ensure compliance with security standards and best practices in cloud service deployment and configuration.
- Conduct regular audits and assessments of cloud resources and services to ensure optimal performance and security.
- Lead initiatives related to SLIs, SLOs, and error budgets in collaboration with the R&D team to proactively manage platform stability.
- Enhance system observability through effective monitoring, alerting, and metrics reporting.
- Implement observability solutions (logs, metrics, traces) for cloud foundational platforms and promote best practices in reliability engineering.
- Mentor and build a high-performing team to achieve both personal and organizational goals.
Requirements:
- Bachelor’s or Master’s degree in Computer Science or a related field.
- Over 12 years of experience managing services in large-scale environments, with at least 3 years in a leadership role.
- 5+ years of SRE experience focusing on telemetry, observability, self-healing solutions, and platform automation.
- Proficiency in several programming languages, including Java, Python, and JavaScript (5+ years of experience).
- Hands-on experience with build and release tools such as Jenkins, Sonar, Artifactory, JIRA, and GitLab, along with a strong CI/CD understanding.
- Familiarity with public cloud environments like AWS, Azure, and GCP (5+ years).
- Experience with observability tools and frameworks, such as Dynatrace, Prometheus, Grafana, and AWS CloudWatch.
- Strong incident response and management skills, demonstrating a proactive and strategic approach to system reliability.
- Hands-on experience with Infrastructure as Code (IaC) and configuration management tools (e.g., Terraform, Puppet).
- Demonstrated integrity, strong ownership, and excellent communication and collaboration skills.
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.
Success! Refresh the page to see how your skills align with this role.
The Company