We are looking for an experienced Senior Site Reliability Engineer (SRE) to own the reliability, availability, and operational excellence of business-critical production systems.
This is a dedicated Site Reliability Engineering role—not a general DevOps or Infrastructure position. You will define how reliability is measured, lead incident response during production outages, drive observability strategy, and continuously improve operational practices across high-availability environments.
The ideal candidate has hands-on experience managing SLOs, leading major incidents, improving on-call operations, and building a strong reliability culture through automation, observability, and continuous improvement.
Responsibilities:
- Define, implement, and continuously improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
- Develop and maintain observability strategies, including monitoring, logging, tracing, and alerting.
- Own observability configuration, instrumentation, and alert optimization.
- Lead Incident Command during production incidents and coordinate cross-functional response efforts.
- Drive blameless postmortems and ensure corrective actions are completed.
- Own and continuously improve the on-call program, including rotations, escalation policies, runbooks, and alert tuning.
- Establish production readiness standards for new services.
- Partner with engineering teams on capacity planning, scalability, and disaster recovery initiatives.
- Automate operational processes and reliability improvements using software engineering best practices.
- Continuously improve system reliability, availability, and operational efficiency.
Requirements:
- 5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
- Proven experience operating production systems in high-availability environments.
- Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
- Experience leading production incident response and Incident Command.
- Strong observability and monitoring experience.
- Strong software engineering skills using Python, Go, or TypeScript.
- Experience working with cloud platforms.
- Strong written and verbal English communication skills.
Must have:
- Proven Site Reliability Engineering experience.
- Experience defining and managing:
- Service Level Indicators (SLIs)
- Service Level Objectives (SLOs)
- Error Budgets
- Experience leading Incident Command during major production incidents.
- Experience conducting blameless postmortems and driving follow-up actions.
- Experience designing, maintaining, and improving on-call programs.
- Experience developing runbooks and escalation policies.
- Strong observability experience, including:
- Monitoring
- Logging
- Alerting
- Distributed Tracing
- Experience tuning alerts to reduce operational noise.
- Strong automation skills using Python, Go, or TypeScript.
- Experience supporting mission-critical production systems.
- Experience working in high-availability production environments.
Nice to have:
- Experience with Datadog.
- Experience with AWS.
- Experience with Heroku.
- Experience working in regulated industries (Healthcare, HIPAA, Financial Services, etc.).
- Experience establishing or maturing an SRE practice.
- Capacity planning experience.
- Disaster recovery planning and execution.
- Experience with Kubernetes.
- Experience with PostgreSQL or SQL Server.
- Experience supporting modern TypeScript-based applications.
Skills Required
- 5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
- Proven experience operating production systems in high-availability environments.
- Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
- Experience leading production incident response and Incident Command.
- Experience conducting blameless postmortems and driving follow-up actions.
- Designing, maintaining, and improving on-call programs, including rotations and escalation policies.
- Developing runbooks and escalation policies.
- Strong observability experience including monitoring, logging, alerting, and distributed tracing.
- Experience tuning alerts to reduce operational noise.
- Strong software engineering skills using Python, Go, or TypeScript.
- Experience working with cloud platforms.
- Experience supporting mission-critical production systems and high-availability environments.
- Strong written and verbal English communication skills.
- Experience with Datadog.
- Experience with AWS.
- Experience with Heroku.
- Experience establishing or maturing an SRE practice.
- Capacity planning experience.
- Disaster recovery planning and execution.
- Experience with Kubernetes.
- Experience with PostgreSQL or SQL Server.
- Experience supporting modern TypeScript-based applications.
- Experience working in regulated industries (Healthcare, HIPAA, Financial Services, etc.).
What We Do
We are on a mission to give every company, no matter the size, the opportunity to innovate and help build a better future. Our Services: Dedicated Tech (full/partial) Squads: we create multi-disciplinary, remote (near-shore) Tech Squads that become part of your team. They adapt to your workflows and are trained on Agile methodologies to deliver continuous value. We believe that well-trained remote teams bring clients the opportunity to increase innovation output by accessing a greater / more diverse pool of talent, while reducing the cost of development. On-Demand Software Development: at our core, we are software developers excited about building digital products and solutions using the latest technologies and agile methodologies. We provide end-to-end capabilities to deliver on your technical requests. Product Management, Tech Architecture, Front / Back End Development, Devops & QA Venture Building: we partner with companies to co-launch new digital businesses that leverage core assets of the company (distribution channels, customer base, industry knowledge, proprietary technology, etc). We take the co-created ideas into MLP’s (Most Lovable Product) aiming to find product market fit and scale in the leanest possible way. As startup founders ourselves, we love getting things from 0 to 1. We are End-To-End Innovation Enablers, helping your company unlock it's full innovation potential.








