Senior Engineer - Site Reliability Engineering

Reposted 24 Days Ago
Be an Early Applicant
Bengaluru, Karnataka, IND
In-Office
Senior level
Cloud
The Role
The Senior Site Reliability Engineer will design automation solutions for cloud infrastructure, monitor performance, lead incident management, and promote reliability engineering best practices.
Summary Generated by Built In
Role Overview
MontyCloud is actively seeking a Senior Site Reliability Engineer (Senior SRE) to elevate the reliability, scalability, and performance of our cloud management platform. You will be at the forefront of developing and maintaining sophisticated tools and systems to automate and optimize the management and monitoring of our cloud infrastructure and applications. Your expertise will be instrumental in guiding our operational practices towards excellence and innovation.

Key Responsibilities
  • Design and implement sophisticated automation solutions for the management and monitoring of our cloud infrastructure and applications, with a focus on enhancing the reliability and performance of our SaaS platform.
  • Proactively collaborate with other SREs and cross-functional teams to preemptively address potential issues, maintaining the highest standards of reliability for our platform.
  • Monitor the health and performance of our cloud infrastructure and applications, implementing advanced strategies for troubleshooting and optimization.
  • Participate in the on-call rotation, offering expert response and resolution to incidents and issues.
  • Champion best practices in reliability engineering, including disaster recovery, chaos engineering, and incident response, to foster a culture of continuous improvement.
  • Lead by example in conducting thorough post-mortem analyses, ensuring lessons learned are integrated into future operations.
  • Excellent communication skills, capable of working independently and collaboratively within a dynamic team environment.

Desired Skills and Requirements

Must Have
  • Problem-solving skills
  • Cloud: AWS
  • Scripting: Python
  • Automation/Configuration Management: Ansible, Puppet, Chef
  • Monitoring/Observability: Splunk, New Relic, Datadog, AWS CloudWatch, AWS X-Ray
  • Disaster recovery and incident management

Good-to-Have
  • General Dev Experience: Application development (any stack)
  • CI/CD: (Implied but not specified — any experience with Jenkins, GitLab CI, etc. is a plus)
  • Chaos Engineering Tools: (e.g., Gremlin, Chaos Monkey — not explicitly mentioned but implied)

Experience
  • 5 years of experience as an SRE, specifically within a SaaS platform environment, demonstrating a clear understanding of the unique challenges and opportunities in SaaS.
  • 3 years of experience specifically in managing and optimizing SaaS platforms.
  • 3 years of expert knowledge and hands-on experience with AWS.
  • 4 years of experience using automation tools like Ansible, Puppet, or Chef.
  • 4 years of experience with scripting in Python or similar languages.
  • 3 years of experience using tools like Splunk, New Relic, Datadog, AWS CloudWatch, or AWS X-Ray.
  • 3 years of experience leading disaster recovery efforts in current and previous roles.
  • 3 years of experience implementing chaos engineering practices in live environments.
  • 4 years of active involvement in on-call rotations and incident management.
  • 4+ years of end-to-end application development experience, showcasing familiarity with the complete software development lifecycle and a strong ability to design, implement, and deploy functional, scalable applications.
  • 3 years of experience leading post-mortem analysis sessions following major incidents.

Education
  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field, or equivalent hands-on experience in building and operating large-scale cloud/SaaS platforms.

Skills Required

  • 5 years of experience as an SRE in a SaaS platform environment
  • 3 years of managing and optimizing SaaS platforms
  • Expert knowledge with AWS
  • 4 years using automation tools (Ansible, Puppet, Chef)
  • 4 years scripting in Python or similar languages
  • 3 years using monitoring tools (Splunk, New Relic, Datadog, AWS CloudWatch, AWS X-Ray)
  • 3 years leading disaster recovery efforts
  • 3 years implementing chaos engineering practices
  • 4+ years application development experience
  • 3 years leading post-mortem analysis sessions

MontyCloud Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about MontyCloud and has not been reviewed or approved by MontyCloud.

  • Fair & Transparent Compensation Feedback suggests compensation and benefits are viewed favorably overall, indicating competitive pay positioning for many roles.
  • Healthcare Strength Job postings indicate medical, dental, and vision coverage as part of a comprehensive package in the U.S.
  • Equity Value & Accessibility Listings highlight equity participation as a standard component, signaling accessible ownership opportunities for employees.

MontyCloud Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Redmond, WA
200 Employees
Year Founded: 2018

What We Do

MontyCloud is a Seattle, WA based intelligent Cloud Management Platform Company. Our customers use MontyCloud DAY2™ to instantly close the cloud skills gap, simplify CloudOps, and reduce the total cost of cloud operations up to 70%, all in just a few clicks. By leveraging the AWS public cloud, AI, and ML, DAY2 ™ simplifies provisioning, security, compliance, cost optimization, and routine operations. DAY2™’s automation first, No-Code approach helps customers immediately derive deep insights and deliver intelligent Cloud Operations in just a few minutes. You can try the platform for free at https://MontyCloud.com

Similar Jobs

eBay Logo eBay

Senior Engineering Manager

eCommerce • Retail
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
26035 Employees
In-Office
Bangalore, Bengaluru Urban, Karnataka, IND
528 Employees
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
897 Employees

Optum Logo Optum

Design Specialist - Video (Pune and Gurgaon)

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
160000 Employees

Similar Companies Hiring

Toro TMS Thumbnail
Cloud • Enterprise Web • Sales • Software • Transportation
Chicago, IL
80 Employees
Yooz Thumbnail
Software • Machine Learning • Fintech • Financial Services • Cloud • Automation • Artificial Intelligence
Aimargues, FR
470 Employees
Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account