Senior Site Reliability Engineer (SRE)

Posted 2 Days Ago
Be an Early Applicant
7 Locations
In-Office or Remote
Senior level
Software
The Role
Own reliability, availability, and operational excellence for production systems: define SLIs/SLOs/error budgets, lead incident command, drive observability and automation, runbooks, on-call, postmortems, capacity planning, and continuous reliability improvements.
Summary Generated by Built In
Join Our Team

Oowlish, one of Latin America's rapidly expanding software development companies, is seeking experienced technology professionals to enhance our diverse and vibrant team.

As a valued member of Oowlish, you will collaborate with premier clients from the United States and Europe, contributing to pioneering digital solutions. Our commitment to creating a nurturing work environment is recognized by our certification as a Great Place to Work, where you will have opportunities for professional development, growth, and a chance to make a significant international impact.

We offer the convenience of remote work, allowing you to craft a work-life balance that suits your personal and professional needs. We're looking for candidates who are passionate about technology, proficient in English, and excited to engage in remote collaboration for a worldwide presence.

About the Role:
 

We are looking for an experienced Senior Site Reliability Engineer (SRE) to own the reliability, availability, and operational excellence of business-critical production systems.

This is a dedicated Site Reliability Engineering role—not a general DevOps or Infrastructure position. You will define how reliability is measured, lead incident response during production outages, drive observability strategy, and continuously improve operational practices across high-availability environments.

The ideal candidate has hands-on experience managing SLOs, leading major incidents, improving on-call operations, and building a strong reliability culture through automation, observability, and continuous improvement.

Responsibilities:

  • Define, implement, and continuously improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
  • Develop and maintain observability strategies, including monitoring, logging, tracing, and alerting.
  • Own observability configuration, instrumentation, and alert optimization.
  • Lead Incident Command during production incidents and coordinate cross-functional response efforts.
  • Drive blameless postmortems and ensure corrective actions are completed.
  • Own and continuously improve the on-call program, including rotations, escalation policies, runbooks, and alert tuning.
  • Establish production readiness standards for new services.
  • Partner with engineering teams on capacity planning, scalability, and disaster recovery initiatives.
  • Automate operational processes and reliability improvements using software engineering best practices.
  • Continuously improve system reliability, availability, and operational efficiency.

Requirements:

  • 5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
  • Proven experience operating production systems in high-availability environments.
  • Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
  • Experience leading production incident response and Incident Command.
  • Strong observability and monitoring experience.
  • Strong software engineering skills using Python, Go, or TypeScript.
  • Experience working with cloud platforms.
  • Strong written and verbal English communication skills.

Must have:

  • Proven Site Reliability Engineering experience.
  • Experience defining and managing:
    • Service Level Indicators (SLIs)
    • Service Level Objectives (SLOs)
    • Error Budgets
    • Experience leading Incident Command during major production incidents.
    • Experience conducting blameless postmortems and driving follow-up actions.
    • Experience designing, maintaining, and improving on-call programs.
    • Experience developing runbooks and escalation policies.
    • Strong observability experience, including:
      • Monitoring
      • Logging
      • Alerting
      • Distributed Tracing
      • Experience tuning alerts to reduce operational noise.
      • Strong automation skills using Python, Go, or TypeScript.
      • Experience supporting mission-critical production systems.
      • Experience working in high-availability production environments.

Nice to have:

  • Experience with Datadog.
  • Experience with AWS.
  • Experience with Heroku.
  • Experience working in regulated industries (Healthcare, HIPAA, Financial Services, etc.).
  • Experience establishing or maturing an SRE practice.
  • Capacity planning experience.
  • Disaster recovery planning and execution.
  • Experience with Kubernetes.
  • Experience with PostgreSQL or SQL Server.
  • Experience supporting modern TypeScript-based applications.


Benefits & Perks:

Home office;
Competitive compensation based on experience;
Career plans to allow for extensive growth in the company;
International Projects;
Oowlish English Program (Technical and Conversational);
Oowlish Fitness with Total Pass;
Games and Competitions;


You can also apply here:

Website: https://www.oowlish.com/work-with-us/
LinkedIn: https://www.linkedin.com/company/oowlish/jobs/
Instagram: https://www.instagram.com/oowlishtechnology/


Skills Required

  • 5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
  • Proven experience operating production systems in high-availability environments.
  • Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
  • Experience leading production incident response and Incident Command.
  • Experience conducting blameless postmortems and driving follow-up actions.
  • Designing, maintaining, and improving on-call programs, including rotations and escalation policies.
  • Developing runbooks and escalation policies.
  • Strong observability experience including monitoring, logging, alerting, and distributed tracing.
  • Experience tuning alerts to reduce operational noise.
  • Strong software engineering skills using Python, Go, or TypeScript.
  • Experience working with cloud platforms.
  • Experience supporting mission-critical production systems and high-availability environments.
  • Strong written and verbal English communication skills.
  • Experience with Datadog.
  • Experience with AWS.
  • Experience with Heroku.
  • Experience establishing or maturing an SRE practice.
  • Capacity planning experience.
  • Disaster recovery planning and execution.
  • Experience with Kubernetes.
  • Experience with PostgreSQL or SQL Server.
  • Experience supporting modern TypeScript-based applications.
  • Experience working in regulated industries (Healthcare, HIPAA, Financial Services, etc.).
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Fortaleza
106 Employees
Year Founded: 2017

What We Do

We are on a mission to give every company, no matter the size, the opportunity to innovate and help build a better future. Our Services: Dedicated Tech (full/partial) Squads: we create multi-disciplinary, remote (near-shore) Tech Squads that become part of your team. They adapt to your workflows and are trained on Agile methodologies to deliver continuous value. We believe that well-trained remote teams bring clients the opportunity to increase innovation output by accessing a greater / more diverse pool of talent, while reducing the cost of development. On-Demand Software Development: at our core, we are software developers excited about building digital products and solutions using the latest technologies and agile methodologies. We provide end-to-end capabilities to deliver on your technical requests. Product Management, Tech Architecture, Front / Back End Development, Devops & QA Venture Building: we partner with companies to co-launch new digital businesses that leverage core assets of the company (distribution channels, customer base, industry knowledge, proprietary technology, etc). We take the co-created ideas into MLP’s (Most Lovable Product) aiming to find product market fit and scale in the leanest possible way. As startup founders ourselves, we love getting things from 0 to 1. We are End-To-End Innovation Enablers, helping your company unlock it's full innovation potential.

Similar Jobs

Circle (circle.so) Logo Circle (circle.so)

Senior Site Reliability Engineer

Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Easy Apply
Remote
31 Locations
250 Employees
130K-140K Annually

Enumerate Logo Enumerate

Senior Site Reliability Engineer

Professional Services • Software
Remote
11 Locations
120 Employees
4K-5K Annually
Remote
Colombia
4400 Employees

Enumerate Logo Enumerate

Senior Site Reliability Engineer

Professional Services • Software
Remote
11 Locations
120 Employees
4K-5K Annually

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account