SRE Availability Engineer

Posted 2 Days Ago
Be an Early Applicant
Bangalore, Bengaluru Urban, Karnataka, IND
In-Office
Mid level
Security • Software • Cybersecurity
The Role
Lead observability and incident response for production systems: manage monitoring, alerting, tracing, logging, on-call rotations, runbooks, SLO/SLA enforcement, automated remediation, synthetic monitoring, and post-incident RCA to reduce MTTD/MTTR.
Summary Generated by Built In

Who we are

DigiCert is a global leader in intelligent trust. We protect the digital world by ensuring the security, privacy, and authenticity of every interaction. Our AI-powered DigiCert ONE platform unifies PKI, DNS, and certificate lifecycle management, to secure infrastructure, software, devices, messages, AI content and agents. Learn why more than 100,000 organizations, including 90% of the Fortune 500, choose DigiCert to stop today’s threats and prepare for a quantum-safe future at www.digicert.com

Job summary

We are seeking a highly skilled Observability & Incident Response Site Reliability Engineer (SRE) to own incident management practices across all production systems. In this role, you will be the subject matter expert for monitoring, alerting, tracing, and logging and lead incident response efforts. You will work at the intersection of product engineering, platform, and security teams to ensure our systems are observable, resilient, and compliant with SLA/SLO commitments.

 

What you will do

  • Excellent knowledge on Kubernetes clusters and container workloads for production reliability.
  • Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments, repeated manual tasks (Harness, GitHub Actions, etc.)
  • Act as the primary Incident Manager for high priority production incidents — coordinating swift resolution across engineering, infrastructure, and business teams.
  • Own and continuously improve incident response runbooks, escalation matrices, and on-call schedules.
  • Drive root cause analysis for all major incidents — ensuring root cause analysis, action item tracking, and long-term resolution.
  • Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) through proactive alerting and automated remediation.
  • Establish and enforce SLA/SLO/SLI frameworks across all production services.
  • Build automated runbooks and self-healing mechanisms to reduce manual intervention during incidents.
  • Implement synthetic monitoring to proactively detect customer-facing issues.
  • Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability across production systems.
  • Exceptional communication skills — able to lead high-pressure incident bridges calmly and clearly.
  • Detail-oriented with a strong sense of ownership and accountability.
  • Ability to manage multiple concurrent incidents and priorities without losing composure.

 

What you will have

  • 3+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles.
  • Hands-on experience leading incident response for high-severity production incidents.
  • Strong background in Linux systems administration and distributed systems troubleshooting.
  • Experience defining and managing SLOs, SLIs, and Error Budgets in production.

 

Nice to have

  • Monitoring & alerting: New Relic, Nagios, or equivalent.
  • Log management: Splunk.
  • Incident management: PagerDuty, OpsGenie, VictorOps, or equivalent.
  • Container orchestration: Kubernetes, Helm, Docker — with deep observability integration experience.
  • Scripting & automation: Python, Bash or similar for building tooling and automations.
  • Infrastructure as Code: Terraform, Salt.
  • CI/CD pipelines: GitHub Actions, Harness.

 

Benefits

  • Generous time off policies.
  • Top shelf benefits.
  • Education, wellness and lifestyle support.

 

To protect candidate information and maintain a secure hiring process, all applications must be submitted through our careers portal. Resumes or CVs sent directly via email will not be reviewed or considered.

 

LI-SD1

Skills Required

  • 3+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles
  • Hands-on experience leading incident response for high-severity production incidents
  • Excellent knowledge of Kubernetes clusters and container workloads for production reliability
  • Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments (e.g., Harness, GitHub Actions)
  • Strong background in Linux systems administration and distributed systems troubleshooting
  • Experience defining and managing SLOs, SLIs, and Error Budgets in production
  • Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability
  • Act as primary Incident Manager, coordinate cross-team resolution, and own incident runbooks and escalations
  • Exceptional communication skills and ability to manage multiple concurrent incidents under pressure
  • Monitoring & alerting tools (New Relic, Nagios) and incident management tools (PagerDuty, OpsGenie, VictorOps)
  • Container tooling and observability integration: Helm, Docker
  • Scripting & automation experience (Python, Bash) for tooling and automations
  • Infrastructure as Code experience (Terraform, Salt)

DigiCert Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about DigiCert and has not been reviewed or approved by DigiCert.

  • Leave & Time Off Breadth Vacation/PTO and sick leave are characterized as strong, and some accounts mention a sabbatical program.
  • Retirement Support The package includes a 401(k) with company matching, with recent confirmations of this benefit.
  • Flexible Benefits Hybrid and work-from-home options are referenced consistently, indicating practical flexibility in how and where work is done.

DigiCert Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Lehi, Utah
1,372 Employees
Year Founded: 2003

What We Do

DigiCert is the digital trust provider of choice for leading companies around the globe, enabling individuals, businesses, governments, and consortia to engage online with confidence, knowing their digital footprint is secure.

Similar Jobs

Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
897 Employees

Wells Fargo Logo Wells Fargo

Consultant

Fintech • Financial Services
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
205000 Employees

Wells Fargo Logo Wells Fargo

Product Manager

Fintech • Financial Services
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
205000 Employees

Wells Fargo Logo Wells Fargo

Product Manager

Fintech • Financial Services
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
205000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account