Grubtech

Site Reliability Engineer

Reposted Yesterday

Be an Early Applicant

Colombo, LKA

Hybrid

Senior level

eCommerce • Retail • Sales • Software

The Role

The role focuses on enhancing the reliability and performance of Grubtech's production systems through effective management of cloud environments and incident response.

Summary Generated by Built In

Grubtech is a unified commerce engine purpose-built for the food and beverage industry. We serve a wide
range of customers - from SMBs to mid-market and enterprise brands - helping them manage and scale
their operations across multiple digital and physical channels.
Our platform integrates online ordering, POS, delivery aggregators, loyalty, and more - giving restaurants
the tools they need to thrive in a digital-first world.

Role Overview
This is a key role focused on improving the reliability, availability, performance, and operational maturity
of Grubtech's production systems. This individual will manage and improve AWS-based cloud
environments, including ECS-based workloads, strengthen monitoring, alerting, logging, and observability
capabilities, and support effective incident management for mission-critical workloads. The role will
partner closely with application, DevOps, infrastructure, and support teams to prevent incidents, respond
quickly when issues occur, improve production readiness, and reduce operational toil through automation
and continuous improvement.

Profile:
• Bachelor’s degree in computer science, Software Engineering or related field.

• Minimum 5 years of hands-on experience in Site Reliability Engineering, DevOps, cloud platform
engineering, infrastructure operations, or production engineering.

• Strong hands-on experience operating, troubleshooting, and improving production workloads in
AWS; Azure or on-prem deployments would be an added advantage.

• Experience with core AWS services and production operations, including VPC, EC2, ECS, IAM, Load
Balancers, CloudWatch, RDS, Security Groups, and related cloud services.

• Hands-on working experience with Datadog is a must, including monitoring, alerting, application
performance monitoring, logging, dashboards, and service health visibility.

• Ability to continuously improve existing Datadog dashboards, monitors, alert thresholds, and
operational views as services evolve and production needs change.

• Experience managing and improving incident management capabilities, including incident triage,
escalation, communication, root-cause analysis, post-incident reviews, and follow-up actions.

• Experience defining and improving reliability practices such as SLOs, SLIs, error budgets, runbooks,
playbooks, operational readiness checks, and on-call processes.

• Experience troubleshooting distributed systems, AWS infrastructure, ECS workloads, networking,
databases, and application performance issues in production environments.

• Experience in multiple scripting languages such as Python, Bash, PowerShell, JavaScript etc.

• Experience with managed data platforms such as MongoDB Atlas, Confluent Cloud, Couchbase,
PlanetScale, ClickHouse, Redis, Postgres etc.

• Experience supporting mission critical Linux systems at scale; Windows experience is optional but
good to have.

• Experience supporting cloud networking DNS, Web Application Firewall, Security Groups,
Network Access Control List, load balancers etc.

• Experience supporting containerized workloads using Docker and AWS ECS.

• Expertise with cloud monitoring and management systems.

• Experience with cloud security principles and best practices.

• Familiarity with GitHub and GitHub Actions for managing CI/CD pipelines, release workflows, and
deployment automation.

• Experience with monitoring and management tools such as Datadog, Prometheus, Grafana, ELK
etc.

• Ability to analyze current technology and operational processes, then develop practical steps to
improve reliability, alert quality, scalability, and operational efficiency.

• Willingness to participate in incident response and on-call support for production systems when
required.

• Strong problem solving and analytical skills.

• Strong English communication skills.

• Ability to multitask, work well under pressure and prioritize work against competing deadlines
and changing business priorities.

Skills Required

Bachelor's degree in computer science, Software Engineering or related field
Minimum 5 years of hands-on experience in Site Reliability Engineering, DevOps, or related fields
Experience operating, troubleshooting, and improving production workloads in AWS
Hands-on experience with Datadog for monitoring and alerting
Experience defining and improving reliability practices like SLOs and SLIs
Strong problem solving and analytical skills

View all jobs at Grubtech

View Grubtech Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

Dubai

193 Employees

Year Founded: 2019

What We Do

Grubtech is a unified commerce engine for Enterprise F&B, Grocery and Pharmaceutical Merchants using multiple online sales channels and back-end operations. Our main product, gOnline, connects all order sources to downstream systems like POS, ERP, Fleet Management, 3PLs, and Loyalty Programs. Our smart solutions help smoothen business operations and make the most of data for important decision making. Based in Dubai, Grubtech also has offices in Sri Lanka, Egypt, and Spain, serving customers in 18 markets.