Lead Site Reliability Engineer, DevOps

Posted 9 Days Ago
Be an Early Applicant
Pune, Mahārāshtra
In-Office
Senior level
Information Technology • Security • Cybersecurity
The Role
The Senior Site Reliability Engineer will enhance observability and reliability in large distributed systems through monitoring, incident response, and automation.
Summary Generated by Built In

Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Job Title

Senior Site Reliability Engineer (SRE) – Observability & DevOps

Role Summary

We are looking for a Senior SRE who will own and evolve our observability and reliability platform. The ideal candidate has strong Linux fundamentals, hands-on experience with modern monitoring stacks, and the ability to design scalable alerting and metrics pipelines for large, distributed systems.

This role requires both deep technical expertise and production ownership mindset.

Primary ResponsibilitiesObservability & Monitoring
  • Design, implement, and maintain end-to-end observability using:
    • Prometheus for metrics collection
    • Alertmanager for alert routing, deduplication, and escalation
    • Grafana for visualization and dashboards
    • AppDynamics for APM, transaction tracing, and application health
  • Build actionable dashboards for:
    • SLIs, SLOs, and error budgets
    • Application, infrastructure, and platform health
  • Reduce alert fatigue by implementing signal-based alerting and proper severity models
Data & Metrics Platform
  • Manage and optimize ClickHouse for:
    • High-volume metrics, logs, or traces
    • Long-term retention and fast analytical queries
  • Work on schema design, performance tuning, and cost optimization
Reliability & Operations
  • Define and measure SRE best practices (SLIs, SLOs, SLAs)
  • Participate in incident response, postmortems, and root cause analysis
  • Drive reliability improvements through automation and capacity planning
Automation & Engineering
  • Develop tooling and automation using at least one scripting/programming language
  • Automate monitoring onboarding, alert generation, dashboard creation
  • Improve operational efficiencies across DevOps tooling
Required Technical Skills (Must-Have)Core Skills
  • Strong Linux fundamentals
    • Troubleshooting, performance tuning, networking, system internals
  • Scripting / Programming (Any one or more):
    • Python (preferred), Bash, Go, or similar
  • Observability Tools (Hands-on):
    • Prometheus
    • Alertmanager
    • Grafana
    • AppDynamics
  • Data Platform:
    • Hands-on experience with ClickHouse
Monitoring & Alerting Concepts
  • Metrics vs logs vs traces
  • Golden signals (latency, traffic, errors, saturation)
  • Alert thresholds, routing policies, escalation strategies
Preferred / Nice-to-Have Skills
  • Kubernetes monitoring (Prometheus Operator, kube-state-metrics)
  • Infrastructure as Code (Terraform, Helm)
  • CI/CD observability
  • Cloud platforms (AWS / Azure / GCP)
  • Experience managing observability at scale (100+ services / platforms)
Senior-Level Expectations
  • Ability to architect observability solutions, not just operate them
  • Strong production troubleshooting and incident ownership
  • Mentoring junior engineers
  • Influence DevOps and SRE best practices across teams
  • Communicate clearly with developers and leadership
Experience & Qualification
  • 5-7 years of experience in SRE / DevOps / Production Engineering
  • Experience operating high-availability, large-scale systems
  • Proven background in observability-driven reliability improvements

Top Skills

Alertmanager
Appdynamics
Bash
Clickhouse
Go
Grafana
Prometheus
Python
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
2,736 Employees
Year Founded: 1999

What We Do

Qualys, Inc. (NASDAQ: QLYS) is a pioneer and leading provider of disruptive cloud-based security, compliance and IT solutions with more than 10,000 subscription customers worldwide, including a majority of the Forbes Global 100 and Fortune 100. Qualys helps organizations streamline and automate their security and compliance solutions onto a single platform for greater agility, better business outcomes, and substantial cost savings.
The Qualys Cloud Platform leverages a single agent to continuously deliver critical security intelligence while enabling enterprises to automate the full spectrum of vulnerability detection, compliance, and protection for IT systems, workloads and web applications across on premises, endpoints, servers, public and private clouds, containers, and mobile devices. Founded in 1999 as one of the first SaaS security companies, Qualys has strategic partnerships and seamlessly integrates its vulnerability management capabilities into security offerings from cloud service providers, including Amazon Web Services, the Google Cloud Platform and Microsoft Azure, along with a number of leading managed service providers and global consulting organizations. For more information, please visit http://www.qualys.com

Similar Jobs

TransUnion Logo TransUnion

Analyst, Tax

Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
Hybrid
Pune, Mahārāshtra, IND
13000 Employees

TransUnion Logo TransUnion

C++ Developer

Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
Hybrid
2 Locations
13000 Employees

Mastercard Logo Mastercard

Manager, BizOps

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Hybrid
Pune, Mahārāshtra, IND
38800 Employees

Mastercard Logo Mastercard

Director, Citizen Development & Finance Solutioning

Blockchain • Fintech • Payments • Consulting • Cryptocurrency • Cybersecurity • Quantum Computing
Hybrid
Pune, Mahārāshtra, IND
38800 Employees

Similar Companies Hiring

Credal.ai Thumbnail
Software • Security • Productivity • Machine Learning • Artificial Intelligence
Brooklyn, NY
Standard Template Labs Thumbnail
Software • Information Technology • Artificial Intelligence
New York, NY
15 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account