Senior Site Reliability Engineer (SRE)

Sorry, this job was removed at 06:20 p.m. (CST) on Sunday, Mar 23, 2025
Be an Early Applicant
Lahore, Punjab
In-Office
HR Tech • Software
The Role

Description
Job Overview:

We are looking for a highly skilled Senior Site Reliability Engineer (SRE) with expertise in monitoring, performance optimization, and ensuring high availability for SaaS web applications. The ideal candidate will be responsible for building, scaling, and maintaining reliable systems that can handle large traffic loads while ensuring minimal downtime. This role will focus on monitoring application performance, uptime, and reliability, working closely with engineering and DevOps teams to maintain seamless customer experiences. If you have a passion for automating reliability and scalability while maintaining the uptime of critical services, we’d love to have you on our team.

Key Responsibilities:

  • Monitoring and Observability:
    • Design and implement monitoring solutions to ensure the health, performance, and availability of SaaS web applications and infrastructure.
    • Develop and maintain dashboards, alerts, and reporting systems for proactive monitoring of application performance, user experience, and system health.
    • Ensure end-to-end observability by integrating log aggregation, metrics, and tracing tools to identify and resolve issues before they impact customers.
  • Incident Management & Root Cause Analysis:
    • Lead the response to production incidents, working with cross-functional teams to identify the root cause and implement effective remediation strategies.
    • Drive post-incident reviews and document incidents, identifying areas for improvement in systems, processes, and response strategies.
    • Create and enforce procedures for incident management, on-call rotations, and escalations.
  • Reliability & Availability:
    • Collaborate with engineering and DevOps teams to implement strategies for ensuring high availability, scalability, and disaster recovery for critical services.
    • Ensure systems are designed to handle high traffic loads and remain resilient to failures by building and deploying robust monitoring frameworks and automation tools.
    • Focus on reducing mean time to recovery (MTTR) and increasing mean time between failures (MTBF) across the SaaS platform.
  • Automation & Efficiency:
    • Drive automation efforts to eliminate manual intervention and improve system reliability through automated testing, deployment, and monitoring pipelines.
    • Collaborate with the development team to implement changes that improve system reliability and efficiency.
  • Capacity Planning & Performance Tuning:
    • Monitor system resource usage and identify potential capacity issues, driving proactive scaling and performance tuning initiatives.
    • Use performance metrics to predict scaling needs and ensure the infrastructure can meet the growing demands of the platform.
  • Collaboration & Cross-Functional Engagement:
    • Work closely with developers, product managers, and DevOps engineers to improve application performance and reliability through better code, infrastructure, and operational practices.
    • Act as a mentor to junior SREs, sharing knowledge about best practices for monitoring, scaling, and troubleshooting complex web applications.
  • Continuous Improvement & Best Practices:
    • Establish and promote best practices for reliability engineering, monitoring standards, incident management, and performance optimization.
    • Stay current with industry trends and evaluate new tools and technologies to improve service reliability and monitoring practices.
Requirements
Required Skills and Qualifications:
  • Experience:
    • 5+ years of experience as a Site Reliability Engineer (SRE), Systems Engineer, or DevOps Engineer with a focus on monitoring, reliability, and performance for SaaS-based web applications.
    • Proven track record in designing and maintaining monitoring systems for large-scale, high-availability applications.
  • Technical Skills:
    • Strong experience with monitoring, logging, and alerting tools such as Prometheus, Grafana, Datadog, ELK Stack (Elasticsearch, Logstash, Kibana), New Relic, or similar.
    • Expertise in setting up and managing cloud-based infrastructure monitoring (AWS CloudWatch, Google Cloud Operations, etc.).
    • Experience with containerized applications (Docker, Kubernetes) and orchestrating infrastructure at scale.
  • Scripting & Automation:
    • Proficiency in automation tools (e.g., Terraform, Ansible, Chef, Puppet) and programming/scripting languages (e.g., Python, Go, Shell).
    • Experience building and managing automated pipelines for CI/CD, deployment, and monitoring.
  • Incident Response & Troubleshooting:
    • Expertise in incident response, troubleshooting production issues, root cause analysis, and leading post-mortems to improve system reliability.
    • Familiarity with on-call responsibilities, managing high-pressure situations, and minimizing downtime for customers.
  • Cloud & Infrastructure Experience:
    • Experience with cloud platforms (AWS, GCP, Azure) and managing infrastructure at scale.
    • Understanding of distributed systems, microservices architecture, and how to monitor and manage them effectively.
  • Performance Tuning & Optimization:
    • Strong understanding of application performance tuning, database performance, and infrastructure optimizations.
    • Experience with system performance monitoring, profiling, and resource management.

Preferred Qualifications:

  • Education:
    • Bachelor’s degree in Computer Science, Information Technology, or a related field.
    • Relevant certifications (e.g., AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, Kubernetes certifications) are a plus.
  • Soft Skills:
    • Excellent problem-solving and troubleshooting skills, with a systematic approach to diagnosing issues.
    • Strong collaboration and communication skills, capable of working in a cross-functional team environment.
    • Ability to work independently and take ownership of projects, driving them to completion with minimal supervision.
  • Domain Expertise:
    • Experience working in the SaaS or fintech domain is a plus, understanding the unique reliability, security, and compliance needs of these environments.

Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
26 Employees
Year Founded: 2022

What We Do

HR Force was built with the vision of connecting the right people with the right business entity. We are firm believers in utilization of talent for the advancement & betterment of humankind, and live by a policy of "no talent goes unnoticed”.

Our Services include
Talent Acquisition Management
Culture & Employee Branding
Compensation & Benefits Management
Performance Management & Training Development
Internal Policy Creation and Implementation

Similar Companies Hiring

PRIMA Thumbnail
Travel • Software • Marketing Tech • Hospitality • eCommerce
US
15 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Fairly Even Thumbnail
Software • Sales • Robotics • Other • Hospitality • Hardware
New York, NY

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account