DigiCert

SRE Availability Engineer

Posted 2 Days Ago

Be an Early Applicant

Bangalore, Bengaluru Urban, Karnataka, IND

In-Office

Mid level

Security • Software • Cybersecurity

The Role

Lead observability and incident response for production systems: manage monitoring, alerting, tracing, logging, on-call rotations, runbooks, SLO/SLA enforcement, automated remediation, synthetic monitoring, and post-incident RCA to reduce MTTD/MTTR.

Summary Generated by Built In

Who we are

DigiCert is a global leader in intelligent trust. We protect the digital world by ensuring the security, privacy, and authenticity of every interaction. Our AI-powered DigiCert ONE platform unifies PKI, DNS, and certificate lifecycle management, to secure infrastructure, software, devices, messages, AI content and agents. Learn why more than 100,000 organizations, including 90% of the Fortune 500, choose DigiCert to stop today’s threats and prepare for a quantum-safe future at www.digicert.com

Job summary

We are seeking a highly skilled Observability & Incident Response Site Reliability Engineer (SRE) to own incident management practices across all production systems. In this role, you will be the subject matter expert for monitoring, alerting, tracing, and logging and lead incident response efforts. You will work at the intersection of product engineering, platform, and security teams to ensure our systems are observable, resilient, and compliant with SLA/SLO commitments.

What you will do

Excellent knowledge on Kubernetes clusters and container workloads for production reliability.
Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments, repeated manual tasks (Harness, GitHub Actions, etc.)
Act as the primary Incident Manager for high priority production incidents — coordinating swift resolution across engineering, infrastructure, and business teams.
Own and continuously improve incident response runbooks, escalation matrices, and on-call schedules.
Drive root cause analysis for all major incidents — ensuring root cause analysis, action item tracking, and long-term resolution.
Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) through proactive alerting and automated remediation.
Establish and enforce SLA/SLO/SLI frameworks across all production services.
Build automated runbooks and self-healing mechanisms to reduce manual intervention during incidents.
Implement synthetic monitoring to proactively detect customer-facing issues.
Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability across production systems.
Exceptional communication skills — able to lead high-pressure incident bridges calmly and clearly.
Detail-oriented with a strong sense of ownership and accountability.
Ability to manage multiple concurrent incidents and priorities without losing composure.

What you will have

3+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles.
Hands-on experience leading incident response for high-severity production incidents.
Strong background in Linux systems administration and distributed systems troubleshooting.
Experience defining and managing SLOs, SLIs, and Error Budgets in production.

Nice to have

Monitoring & alerting: New Relic, Nagios, or equivalent.
Log management: Splunk.
Incident management: PagerDuty, OpsGenie, VictorOps, or equivalent.
Container orchestration: Kubernetes, Helm, Docker — with deep observability integration experience.
Scripting & automation: Python, Bash or similar for building tooling and automations.
Infrastructure as Code: Terraform, Salt.
CI/CD pipelines: GitHub Actions, Harness.

Benefits

Generous time off policies.
Top shelf benefits.
Education, wellness and lifestyle support.

To protect candidate information and maintain a secure hiring process, all applications must be submitted through our careers portal. Resumes or CVs sent directly via email will not be reviewed or considered.

LI-SD1

Skills Required

3+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles
Hands-on experience leading incident response for high-severity production incidents
Excellent knowledge of Kubernetes clusters and container workloads for production reliability
Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments (e.g., Harness, GitHub Actions)
Strong background in Linux systems administration and distributed systems troubleshooting
Experience defining and managing SLOs, SLIs, and Error Budgets in production
Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability
Act as primary Incident Manager, coordinate cross-team resolution, and own incident runbooks and escalations
Exceptional communication skills and ability to manage multiple concurrent incidents under pressure
Monitoring & alerting tools (New Relic, Nagios) and incident management tools (PagerDuty, OpsGenie, VictorOps)
Container tooling and observability integration: Helm, Docker
Scripting & automation experience (Python, Bash) for tooling and automations
Infrastructure as Code experience (Terraform, Salt)

DigiCert Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about DigiCert and has not been reviewed or approved by DigiCert.

Leave & Time Off Breadth — Vacation/PTO and sick leave are characterized as strong, and some accounts mention a sabbatical program.
Retirement Support — The package includes a 401(k) with company matching, with recent confirmations of this benefit.
Flexible Benefits — Hybrid and work-from-home options are referenced consistently, indicating practical flexibility in how and where work is done.

Learn more about DigiCert's Compensation & Benefits →

DigiCert Insights

What's It Like to Work at DigiCert? DigiCert Culture & Values DigiCert Career Growth & Development What's the Work-Life Balance Like at DigiCert? DigiCert Leadership & Management DigiCert Company Growth, Stability & Outlook

View all jobs at DigiCert

View DigiCert Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Lehi, Utah

1,372 Employees

Year Founded: 2003

What We Do

DigiCert is the digital trust provider of choice for leading companies around the globe, enabling individuals, businesses, governments, and consortia to engage online with confidence, knowing their digital footprint is secure.