Site Reliability Engineer (SRE)

Posted 10 Days Ago
Be an Early Applicant
Taguig City, Fourth District NCR, National Capital Region, PHL
In-Office
Mid level
Artificial Intelligence • Professional Services • Consulting • Automation
The Role
Maintain reliability, scalability, and performance of an IoT telemetry platform by defining SLOs and error budgets, implementing monitoring (Prometheus/Grafana), automating infrastructure with Pulumi/TypeScript on AWS (EKS, MSK, SingleStore, MongoDB, S3), leading incident response and post-mortems, enforcing IAM least-privilege, supporting SOC2/ISO27001 compliance, and participating in a global follow-the-sun on-call rotation.
Summary Generated by Built In

We’re an award-winning global outsourcer providing contact center and back office services on behalf of our global clients. Come work at a place where innovation and teamwork come together to support the most exciting missions in the world!

Role objective
The Site Reliability Engineer serves as the guardian of our production systems, ensuring the reliability, scalability, and performance of our IoT telemetry platform. You will define and enforce Service Level Objectives (SLOs), automate operational processes, and build the infrastructure and tooling that enables our engineering teams to deploy with confidence. By implementing comprehensive monitoring, incident response procedures, and reliability practices, you will play a pivotal role in maintaining the uptime and data freshness that our customers depend on for their critical fleet operations.

The role will focus on the following key areas:
SLO Management
Infrastructure Automation
Incident Response
Security & compliance

Key Responsibilities
Responsibilities of the Site Reliability Engineer will include but are not limited to:
Service Level Management & Reliability
• Define, monitor, and enforce Service Level Objectives (SLOs) and error budgets across
all production systems
• Track error budget burn rates and make data-driven decisions to halt risky
deployments when thresholds are exceeded
• Implement comprehensive monitoring and alerting strategies using Prometheus,
Grafana, and PagerDuty
• Establish and maintain reliability standards that support business-critical uptime
requirements
Infrastructure Automation & Management
• Design and implement Infrastructure as Code (IaC) solutions using Pulumi with
TypeScript
• Manage and optimize AWS services including EKS (Elastic Kubernetes Service), MSK
(Managed Streaming for Kafka), SingleStore, MongoDB S3
• Automate operational processes to eliminate toil, targeting any task that consumes
more than 2 engineer-days per quarter
Incident Response & Post-Mortem Leadership
• Serve as incident commander during production outages and service degradations
• Lead comprehensive post-mortem processes within 48 hours of incidents
• Drive "never-again" corrective actions to completion, ensuring systemic improvements
• Maintain and improve incident response procedures and runbooks
Security & Compliance
• Implement and enforce least-privilege IAM policies across all AWS resources
• Manage security patch pipelines and vulnerability remediation processes
• Support compliance initiatives including SOC2 and ISO 27001 certification requirements
• Ensure security best practices are embedded in all infrastructure and operational
procedures
On-Call & Operational Excellence
• Participate in follow-the-sun on-call rotation with one week primary/secondary
commitment every five weeks
• Provide 24×7 support coverage across AU/NZ, EU/ZA, and MX time zones
• Maintain operational runbooks and knowledge transfer documentation
• Continuously improve on-call experience and reduce alert fatigu

Join the A-Team and experience the A-Life!

Skills Required

  • Define and enforce SLOs and manage error budgets across production systems
  • Implement monitoring and alerting using Prometheus and Grafana
  • Operate incident management tooling and use PagerDuty for on-call/alerting
  • Design and implement Infrastructure as Code using Pulumi with TypeScript
  • Manage and optimize AWS services including EKS, MSK (Kafka), S3
  • Experience with SingleStore and MongoDB for production data workloads
  • Serve as incident commander and lead post-mortems with corrective actions
  • Implement least-privilege IAM policies and manage vulnerability remediation
  • Support compliance initiatives such as SOC2 and ISO 27001
  • Participate in a follow-the-sun on-call rotation providing 24x7 coverage
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
9,500 Employees
Year Founded: 2006

What We Do

Acquire Intelligence is a global business transformation company and leading provider of business process outsourcing (BPO) and AI consulting services. Using their Automate, Eliminate, Reallocate framework, they blend process improvement and automation with global outsourcing to help businesses eliminate inefficiencies, drive scale, and achieve real-world outcomes.

Similar Jobs

Lineten Logo Lineten

Site Reliability Engineer

eCommerce • Logistics
In-Office or Remote
2 Locations
51 Employees

Broadridge Logo Broadridge

Senior Site Reliability Engineer

Fintech • Financial Services
In-Office
Manila, Metro Manila, National Capital Region, PHL
14000 Employees

Pod Network Logo Pod Network

Site Reliability Engineer

Information Technology • Software • Web3 • Infrastructure as a Service (IaaS)
In-Office or Remote
17 Locations
11 Employees
In-Office
Manila, First District NCR, National Capital Region, PHL
3062 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account