The DevOps / SRE Engineer owns the operational substrate of an AI-native retail decisioning platform — infrastructure, CI / CD, observability, cost meter, and incident response for a system that runs production agents taking real business actions. The role builds on the enterprise Terraform standard, CI / CD spine, and FinOps tagging policy rather than reinventing parallel infrastructure.
Remote candidates outside of Thailand are welcome to apply.
Key Responsibilities:- Adopt the enterprise Terraform standard and module library for all platform infrastructure; author platform-specific modules where needed (agent runtime, vector DB, knowledge graph); run drift detection weekly.
- Build platform-specific CI / CD pipelines on the enterprise spine — service deploys, agent deploys, eval-gate enforcement; integrate eval gates so no agent reaches production without eval pass.
- Operate rollback orchestration with sub-15-minute recovery; quarterly game days.
- Own the platform observability stack — OpenTelemetry, Langfuse for LLM traces, custom dashboards for per-agent cost.
- Implement the per-agent cost meter end-to-end — token counts, vector queries, model inference, downstream LLM Gateway costs; surface cost data to the enterprise GenAI cost dashboard.
- Stand up the platform on-call rotation; author runbooks for every production agent and service; lead incident response with measurable corrective actions.
- Implement platform cost-tagging policy consistent with the enterprise standard (team, domain, environment, project, agent, suite, persona); report monthly to Cost Review.
- Drive cost optimisation — right-sizing, caching, model routing decisions, reserved compute.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related discipline.
- 5+ years SRE / DevOps with production ownership.
- Terraform at scale — modules, state, drift, environment promotion.
- CI / CD for data + ML / AI services (GitLab CI / CD or comparable).
- Cloud platform (Azure preferred; AWS / GCP transferable).
- Observability — OpenTelemetry, Langfuse (or comparable LLM traces), custom dashboards.
- FinOps — tagging policies, attribution, optimisation.
- Incident response — on-call, post-mortems, runbook authorship.
Preferred Qualifications
- AI / agent platform SRE experience; cost-meter / chargeback systems built or operated.
- Multi-cloud production experience; open-source contributions to IaC / observability tooling.
- AI / ML / agent system observability instrumentation (LLM cost, agent cost, eval scores).
- Vendor certifications such as HashiCorp Terraform Associate / Professional, Azure Solutions Architect Associate, or Databricks Data Engineer Professional.
Skills Required
- 5+ years SRE / DevOps experience
- Expertise in Terraform at enterprise scale
- Experience with CI/CD for ML/AI services
- Knowledge of OpenTelemetry for observability
- Senior-level incident response experience
- Experience from recognized companies in AI or data-intensive platforms
What We Do
Makro PRO is an exciting new digital venture by the iconic Makro. Our proud purpose is to build a technology platform that will help make business possible for restaurant owners, hotels, and independent retailers, and open the door for sellers. Makro PRO brings together the best talent across multi-nationals to transform the B2B marketplace ecosystem. We welcome bold, energetic, and thoughtful people who share our belief in collaboration, diversity, excellence, and putting customers at the heart of our work.








