Who we are
DigiCert is a global leader in intelligent trust. We protect the digital world by ensuring the security, privacy, and authenticity of every interaction. Our AI-powered DigiCert ONE platform unifies PKI, DNS, and certificate lifecycle management, to secure infrastructure, software, devices, messages, AI content and agents. Learn why more than 100,000 organizations, including 90% of the Fortune 500, choose DigiCert to stop today’s threats and prepare for a quantum-safe future at www.digicert.com
Job summary
We are seeking a highly skilled Observability & Incident Response Site Reliability Engineer (SRE) to own incident management practices across all production systems. In this role, you will be the subject matter expert for monitoring, alerting, tracing, and logging and lead incident response efforts. You will work at the intersection of product engineering, platform, and security teams to ensure our systems are observable, resilient, and compliant with SLA/SLO commitments.
What you will do
- Excellent knowledge on Kubernetes clusters and container workloads for production reliability.
- Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments, repeated manual tasks (Harness, GitHub Actions, etc.)
- Act as the primary Incident Manager for high priority production incidents — coordinating swift resolution across engineering, infrastructure, and business teams.
- Own and continuously improve incident response runbooks, escalation matrices, and on-call schedules.
- Drive root cause analysis for all major incidents — ensuring root cause analysis, action item tracking, and long-term resolution.
- Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR) through proactive alerting and automated remediation.
- Establish and enforce SLA/SLO/SLI frameworks across all production services.
- Build automated runbooks and self-healing mechanisms to reduce manual intervention during incidents.
- Implement synthetic monitoring to proactively detect customer-facing issues.
- Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability across production systems.
- Exceptional communication skills — able to lead high-pressure incident bridges calmly and clearly.
- Detail-oriented with a strong sense of ownership and accountability.
- Ability to manage multiple concurrent incidents and priorities without losing composure.
What you will have
- 3+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles.
- Hands-on experience leading incident response for high-severity production incidents.
- Strong background in Linux systems administration and distributed systems troubleshooting.
- Experience defining and managing SLOs, SLIs, and Error Budgets in production.
Nice to have
- Monitoring & alerting: New Relic, Nagios, or equivalent.
- Log management: Splunk.
- Incident management: PagerDuty, OpsGenie, VictorOps, or equivalent.
- Container orchestration: Kubernetes, Helm, Docker — with deep observability integration experience.
- Scripting & automation: Python, Bash or similar for building tooling and automations.
- Infrastructure as Code: Terraform, Salt.
- CI/CD pipelines: GitHub Actions, Harness.
Benefits
- Generous time off policies.
- Top shelf benefits.
- Education, wellness and lifestyle support.
To protect candidate information and maintain a secure hiring process, all applications must be submitted through our careers portal. Resumes or CVs sent directly via email will not be reviewed or considered.
LI-SD1
Skills Required
- 3+ years of experience in SRE, DevOps, Platform Engineering, or Observability Engineering roles
- Hands-on experience leading incident response for high-severity production incidents
- Excellent knowledge of Kubernetes clusters and container workloads for production reliability
- Administer and optimize CI/CD pipelines to support safe, fast, and frequent deployments (e.g., Harness, GitHub Actions)
- Strong background in Linux systems administration and distributed systems troubleshooting
- Experience defining and managing SLOs, SLIs, and Error Budgets in production
- Hands-on experience with Splunk queries to investigate incidents, build dashboards, and drive observability
- Act as primary Incident Manager, coordinate cross-team resolution, and own incident runbooks and escalations
- Exceptional communication skills and ability to manage multiple concurrent incidents under pressure
- Monitoring & alerting tools (New Relic, Nagios) and incident management tools (PagerDuty, OpsGenie, VictorOps)
- Container tooling and observability integration: Helm, Docker
- Scripting & automation experience (Python, Bash) for tooling and automations
- Infrastructure as Code experience (Terraform, Salt)
DigiCert Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about DigiCert and has not been reviewed or approved by DigiCert.
-
Leave & Time Off Breadth — Vacation/PTO and sick leave are characterized as strong, and some accounts mention a sabbatical program.
-
Retirement Support — The package includes a 401(k) with company matching, with recent confirmations of this benefit.
-
Flexible Benefits — Hybrid and work-from-home options are referenced consistently, indicating practical flexibility in how and where work is done.
DigiCert Insights
What We Do
DigiCert is the digital trust provider of choice for leading companies around the globe, enabling individuals, businesses, governments, and consortia to engage online with confidence, knowing their digital footprint is secure.







