MontyCloud

Staff Engineer - Site Reliability Engineering

Reposted 6 Days Ago

Be an Early Applicant

Bengaluru, Karnataka, IND

In-Office

Senior level

Cloud

The Role

Lead reliability and operational excellence for a cloud-native SaaS platform, focusing on automation, AI-driven operations, and system scalability.

Summary Generated by Built In

Role Overview

MontyCloud is seeking a highly experienced Staff Site Reliability Engineer (SRE) to lead reliability, scalability, and operational excellence for our cloud-native, AI-driven SaaS platform. This role requires a strategic, organization-wide impact, combining deep expertise in distributed systems with modern practices in automation, observability, and AI-driven operations (AIOps). You will define reliability standards, influence system architecture, and build intelligent systems that enable engineering teams to operate efficiently and proactively.

As a Staff SRE, you will champion automation-first and AI-augmented reliability engineering, reducing operational toil, improving system resilience, and driving a culture of ownership and continuous improvement across teams.

Key Responsibilities

Define and drive organization-wide reliability strategy, including SLIs, SLOs, SLAs, and error budgets.
Influence system architecture to ensure high availability, scalability, fault tolerance, and operability.
Design and build scalable automation frameworks and internal platforms to reduce operational toil and enable self-service capabilities.
Leverage AI/ML-driven approaches to enhance observability, anomaly detection, and predictive incident prevention.
Implement and optimize AI-assisted incident management, including alert triage, root cause analysis, and automated remediation workflows.
Lead implementation of centralized observability (metrics, logs, traces) and define effective alerting and monitoring strategies.
Drive proactive performance optimization, capacity planning, and system efficiency improvements using data-driven insights.
Lead incident management, including critical incident response, resolution, and blameless postmortems with a focus on systemic fixes.
Design and improve incident and change management workflows, integrating observability with ITSM tools (e.g., ServiceNow, Jira Service Management, PagerDuty).
Automate incident detection, triage, escalation, and remediation workflows to minimize manual intervention.
Champion resilience practices such as disaster recovery, chaos engineering, and failure testing.
Partner with engineering teams to improve CI/CD reliability, release safety, and deployment strategies (e.g., canary, blue-green).
Continuously reduce MTTR, change failure rate, and operational overhead through automation and engineering improvements.
Drive cloud cost optimization and resource efficiency, including optimization of AI/ML workloads and inference costs.
Collaborate with data and ML teams to ensure reliability, scalability, and observability of AI/ML systems, including monitoring for drift and performance degradation.
Mentor engineers and act as a technical leader, influencing best practices and elevating reliability standards across teams.
Foster a culture of ownership, automation-first mindset, and AI-augmented operational excellence.

Desired Skills and Requirements

Must Have

Problem-solving skills
Cloud: AWS
Programming/ Scripting: Python, Go
Containerization: Kubernetes, containers, microservices architectures
Infrastructure as Code (IaC): Terraform, CloudFormation
Automation/Configuration Management: Ansible, Puppet, Chef
Monitoring/Observability: Datadog, Prometheus, Grafana, Splunk, AWS CloudWatch, AWS X-Ray
Reliability Engineering: SLIs, SLOs, SLAs, error budgets
Incident Management & Reliability Frameworks
CI/CD and Release engineering: experience with Jenkins, GitLab CI, etc.
ITSM & Incident Tools: ServiceNow, Jira Service Management, PagerDuty, Opsgenie
AI/ML & AIOps for observability, alerting, incident analysis, and automation
System Design, Scalability, Performance Engineering, and Reliability Trade-offs
Distributed Systems expertise

Good-to-Have

General Dev Experience: Internal Developer Platforms (IDP) & Platform Engineering
Chaos Engineering Tools: e.g., Gremlin, Chaos Monkey etc.
Resilience Testing
Security, Compliance, and Governance in Cloud Environments
Application Development
Agile Methodology
FinOps & Cloud Cost Optimization

Experience

8+ years of experience in Site Reliability Engineering / DevOps / Platform Engineering in SaaS platform environments.
3 years of experience specifically in managing and optimizing SaaS platforms.
3 years of expert knowledge and hands-on experience with AWS.
4 years of experience using automation tools like Ansible, Puppet, or Chef.
4 years of experience with scripting in Python or similar languages.
3 years of experience using tools like Splunk, New Relic, Datadog, AWS CloudWatch, or AWS X-Ray.
3 years of experience leading disaster recovery efforts in current and previous roles.
3 years of experience implementing chaos engineering practices in live environments.
4 years of active involvement in on-call rotations and incident management.
4+ years of end-to-end application development experience, showcasing familiarity with the complete software development lifecycle and a strong ability to design, implement, and deploy functional, scalable applications.
3 years of experience leading post-mortem analysis sessions following major incidents.

Education

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
Equivalent practical experience in large-scale SaaS or cloud-native environments is highly valued.

Skills Required

8+ years of experience in Site Reliability Engineering / DevOps / Platform Engineering in SaaS environments
3 years of experience specifically in managing and optimizing SaaS platforms
3 years of expert knowledge and hands-on experience with AWS
4 years of experience using automation tools like Ansible, Puppet, or Chef
4 years of experience with scripting in Python or similar languages
3 years of experience using monitoring tools like Splunk, Datadog, etc.
Bachelor's or Master's degree in Computer Science, Engineering, or a related field

MontyCloud Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about MontyCloud and has not been reviewed or approved by MontyCloud.

Fair & Transparent Compensation — Feedback suggests compensation and benefits are viewed favorably overall, indicating competitive pay positioning for many roles.
Healthcare Strength — Job postings indicate medical, dental, and vision coverage as part of a comprehensive package in the U.S.
Equity Value & Accessibility — Listings highlight equity participation as a standard component, signaling accessible ownership opportunities for employees.

Learn more about MontyCloud's Compensation & Benefits →

MontyCloud Insights

What's It Like to Work at MontyCloud? MontyCloud Culture & Values MontyCloud Career Growth & Development What's the Work-Life Balance Like at MontyCloud? MontyCloud Leadership & Management MontyCloud Company Growth, Stability & Outlook

View all jobs at MontyCloud

View MontyCloud Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Redmond, WA

200 Employees

Year Founded: 2018

What We Do

MontyCloud is a Seattle, WA based intelligent Cloud Management Platform Company. Our customers use MontyCloud DAY2™ to instantly close the cloud skills gap, simplify CloudOps, and reduce the total cost of cloud operations up to 70%, all in just a few clicks. By leveraging the AWS public cloud, AI, and ML, DAY2 ™ simplifies provisioning, security, compliance, cost optimization, and routine operations. DAY2™’s automation first, No-Code approach helps customers immediately derive deep insights and deliver intelligent Cloud Operations in just a few minutes. You can try the platform for free at https://MontyCloud.com