emnify

Senior Site Reliability Engineer & Incident-Manager (m/f/d)

Reposted 8 Days Ago

Be an Early Applicant

Berlin

In-Office

Senior level

Cloud • Information Technology • Internet of Things

The Role

Drive incident management, improve observability, and support platform engineering. Collaborate with teams to ensure responsive and resilient services on AWS.

Summary Generated by Built In

Your Role

Are you passionate about observability and resiliency? Is ensuring we know about issues before our customers second nature to you? Is being at the front and orchestrating processes sounds fun to you? emnify is seeking a talented Reliability Engineer & Incident Management Operator to drive the company Incident Management routines, be the authority for everything observability and resiliency, and guide internal stakeholders with best practices.

As a part of the larger Engineering department, our Platform team plays a crucial role in enhancing our competitive edge by improving developer experience to increase development efficiency and scale productivity. You will join a team of 3 engineers, fostering empathy and a collaboration mindset to ensure continuous improvement of development experience at emnify. The ideal candidate will have extensive experience with AWS cloud infrastructure, microservices, and modern observability practices as well as strong communication and organizational skills.

The position is 35% Incident management operations, 35% Observability and monitoring work, and 30% platform engineering and developer support.

Emnify technology radar

The position is based in emnify’s office in Berlin.

Your Impact:

Incident management operations:

Lead and optimize the incident management process end-to-end, ensuring timely detection, resolution, and documentation of incidents; coordinating cross-functional teams, conducting post-mortems and root cause analyses, and driving continuous improvements to workflows.

Observability and monitoring:

Design, implement, and continuously improve observability frameworks by developing dashboards, alerts, metrics, and logging strategies to monitor service health, detect anomalies proactively, support issue resolution, and ensure cost-optimized performance across the platform.

Collaboration and Support:

Partner with cross-functional teams to implement observability best practices, providing training and guidance on tools while leveraging metrics data to drive engineering priorities.

Platform engineering:

Leverage AWS to design, build, and maintain a resilient cloud infrastructure, implementing best practices for security, scalability, and cost optimization while ensuring high availability, disaster recovery, and robust platform components such as pipelines, shared infrastructure, and application services.

Your Skills:

• Proven experience as a (Site) Reliability Engineer or similar role in a SaaS and/or telecom company.

• Hands-on experience with observability tools (e.g., Prometheus, Mimir, Grafana, Loki, CloudWatch, Grafana IRM, Rootly), including setup and optimization of metrics and alerts.

• Experience in establishing and managing incident management processes.

• Understanding of incident management frameworks and best practices.

• Extensive experience with AWS cloud services (e.g., EC2, S3, RDS, Lambda, CloudWatch).

• Expert skills with modern infrastructure tooling and principles (Kubernetes, IaaC - Terraform, CI/CD - GitHub Actions, Jenkins)

• Good understanding of modern development tooling and principles (e.g., microservices architecture, 12-factor applications, Docker)

• Advanced documentation skills for effective knowledge sharing and collaboration.

• Exceptional problem-solving and critical thinking with a passion for enhancing development experiences in fast-paced tech environments.

• Ability to work independently and as part of a team.

Nice to have:

• Knowledge of networking protocols and telecom systems

• Knowledge of secure software development

• Familiarity with programming languages such as Python, Go, or Java.

• Certification in AWS (e.g., AWS Certified DevOps Engineer, AWS Certified Solutions Architect)

Top Skills

AWS

Cloudwatch

Docker

Github Actions

Grafana

Java

Jenkins

Kubernetes

Loki

Mimir

Prometheus

Python

Terraform

View all jobs at emnify

View emnify Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Berlin

188 Employees

Year Founded: 2014

What We Do

emnify is the leading cloud building block for cellular communications in the IoT stack, connecting millions of IoT devices globally – from electric vehicles to energy meters, alarm systems to GPS trackers, thermometers to health wearables.

The emnify API and SIM technology connect and secure any kind of IoT deployment to its application back-end. emnify’s cloud-native integrations and no-code workflows ensure seamless lifecycle scalability for deployments of all sizes – from local start-up to global enterprise.

The emnify IoT SuperNetwork is the largest globally distributed mobile cloud core network of its kind, supporting local network access (2G – 5G, LTE-M, NB-IoT) in over 180 countries from more than 25 cloud regions – and counting. emnify’s solution is built on partnerships with the leading hyperscaler cloud service providers, system integrators and hundreds of radio network operators worldwide.

Founded in 2014, emnify was the first to transform cellular IoT connectivity into an easy-to-consume cloud resource – trusted today by thousands of the world’s most innovative companies. To learn more about emnify, please visit www.emnify.com