The Company
Toters is an on-demand e-commerce and delivery platform and operates a service that enables customers to get anything in their city at the highest level of convenience.
At Toters, technology is at the heart of everything we do. We have product teams that are working hard every day to create products that make our customers' lives easier. Our engineers are also continuously creating solutions to make our processes more efficient, all in an effort to get to our customers fast and at the best cost. If you are interested in working in a high growth startup environment, and look to be part of a team that will potentially change the way customers shop in the Middle East, apply now.
About the Role
We are looking for a Mid-Level Site Reliability Engineer who will play a critical role in ensuring high availability, performance, and resilience across our production systems. You will be at the heart of operational excellence, leading high-impact incident responses, building proactive monitoring systems, and engineering automation that prevents outages before they happen. If you love solving complex distributed system challenges and thrive in high-pressure environments, this role is for you.
Key Responsibilities
Incident Management & Reliability
- Act as Incident Commander during major outages, leading real-time diagnosis, communication, and recovery.
- Own and improve the end-to-end incident management lifecycle, including post-incident reviews and action plans.
- Drive root cause analysis and proactive reliability improvements to prevent recurrence.
Monitoring & Observability
- Design and maintain metrics, alerts, and dashboards using Prometheus, Grafana, and New Relic.
- Implement SLIs/SLOs to monitor service health and drive availability targets (99.99%+ uptime).
- Integrate log management and distributed tracing with tools like ELK Stack and AWS X-Ray.
Automation & Tooling
- Develop automation scripts and internal tooling in Python or Node.js to reduce manual ops and accelerate recovery (MTTR improvement).
- Build self-healing infrastructure using IaC and automation pipelines.
- Optimize on-call workflows, escalation policies, and runbooks using PagerDuty.
Cloud Infrastructure
- Operate and improve infrastructure hosted on AWS, ensuring reliability, cost efficiency, and scalability.
- Collaborate with backend and platform teams to embed SRE best practices across engineering.
Key Qualifications
- 2–4 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
- Proven success managing production incidents and participating in on-call rotations.
- Strong hands-on experience with Prometheus, Grafana, and PagerDuty.
- Proficient in Python or Node.js for automation and tooling.
- Experience with AWS services (EC2, CloudWatch, ECS/Lambda, IAM, etc.).
- Solid understanding of Linux systems, networking, and CI/CD pipelines.
Nice to Have
- Experience as Incident Commander in mission-critical environments.
- Knowledge of New Relic, Sentry, ELK Stack, or Datadog.
- Background implementing SLIs/SLOs/Error Budgets (Google SRE model).
- Familiarity with Docker, Kubernetes, Terraform, or Ansible.
- Certifications such as:
- AWS Solutions Architect Associate/DevOps Engineer
- ITIL Foundation or relevant reliability certifications.
Top Skills
What We Do
Enabling last-mile same day delivery of any local product near you. Available for iPhone and Android, the Toters service connects customers with retailers, local couriers, who purchase and deliver goods from any grocery store, restaurant, or other retail shop in your city.
Download the Toters app today or visit www.totersapp.com