At Everbridge, we’re building a resilient, scalable, and secure cloud platform that powers critical services used around the world. We’re looking for a Senior Platform Site Reliability Specialist to own, operate, and evolve our enterprise observability platform.
In this role, you will be responsible for the up-keep, reliability, scalability, and strategic growth of Everbridge’s observability stack, EKS, and supporting services, ensuring our engineering teams have deep visibility into system health, performance, and reliability across a large-scale, cloud-native environment. You will also be working with other cloud technologies within the AWS and GCP areas.
We’re looking for someone who shows up for the team, not just themselves. This role works best for a person who communicates clearly, collaborates easily, and treats interactions with other teams with respect and professionalism. You should be comfortable being involved, offering support, and helping move work forward without ego. We value people who build trust, keep things running smoothly, and make the teams around them better.
What you'll do:
- Head the design, operation, and evolution of Everbridge’s observability stack
- Build and maintain a highly available, scalable observability platform
- Standardize instrumentation, dashboards, alerts, and SLOs
- Support incident response, root cause analysis, and capacity planning Grafana Stack & Telemetry
- Operate and scale Grafana and technology
- Grafana Loki (logs)
- Grafana Mimir (metrics)
- Grafana Tempo (tracing)
- Grafana Alerting Kubernetes
- Maintain reliability and security of EKS clusters running observability
- Manage cluster lifecycle and upgrades Infrastructure as Code & Automation
- Terraform for infrastructure provisioning
- HashiCorp Packer
- Gitlab CI/CD at Scale
What you'll bring:
- 6+ years in SRE / Platform Engineering
- Strong Grafana ecosystem experience
- Kubernetes and Amazon EKS expertise
- Terraform proficiency
Preferred Qualifications:
- OpenTelemetry experience
- Large-scale observability systems
- Cost optimization experience
What We Do
Keeping People Safe and Businesses Running. Faster. Everbridge, Inc. (NASDAQ: EVBG) is a global software company that provides enterprise software applications that automate and accelerate organizations’ operational response to critical events in order to Keep People Safe and Businesses Running™. During public safety threats such as active shooter situations, terrorist attacks or severe weather conditions, as well as critical business events including IT outages, cyber-attacks or other incidents such as product recalls or supply-chain interruptions, over 5,300 global customers rely on the company’s Critical Event Management Platform to quickly and reliably aggregate and assess threat data, locate people at risk and responders able to assist, automate the execution of pre-defined communications processes through the secure delivery to over 100 different communication devices, and track progress on executing response plans.





.png)

