What will you do?
- Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across the three main signals—logs, metrics, and traces—ensuring standardized, scalable, and efficient data ingestion. Optimize ingestion strategies to balance cost, performance, and usability.
- Empower Engineering Teams: Build self-service automation and tooling that enables development teams to instrument and leverage observability without requiring manual intervention from the SRE team. Drive adoption of best practices while ensuring teams own their telemetry.
- Support Incident Management: Be the Engineering side of our Incident Management Team, designing the processes, playbooks, checklists, and automations for them and other engineers to follow during an incident.
- Collaborate Across Teams: Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements. Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development.
- Automate Observability Infrastructure: Leverage Infrastructure-as-Code (IaC) to provision and manage monitoring tools, alerting rules, and our observability configurations across OTEL Pipelines.
- Define Baseline Observability Standards: Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level.
- Own Technical and Security Health: Take full ownership of dLocal’s infrastructure reliability, ensuring adherence to key availability and security KPIs.
- Optimize Alerting Systems: Continuously refine alerting signals to minimize noise and ensure them are always actionable, reducing fatigue and improving response efficiency.
Which skill do you need?
- Over 4 years’ of experience as SRE Engineer or in a very similar role more focused on observability.
- Expertise in Kubernetes, including its core components, deployment methodologies, and monitoring best practices.
- Some understanding of OpenTelemetry, including setting up OTEL collectors, instrumentation, and pipeline optimization.
- Proficiency with monitoring and logging tools such as Grafana, Prometheus, Loki, New Relic, or Datadog.
- Hands-on experience with IaC tools (Terraform) and GitOps CI/CD solutions (ArgoCD, GitHub Actions, or similar).
- Experience integrating incident management platforms (PagerDuty, Jira) with automated alerting workflows.
- Strong scripting abilities (Python, Go, or similar) for automating observability tasks.
- A problem-solving mindset, with the ability to collaborate across multi-functional teams to drive reliability improvements.
- Cloud experience, especially AWS and ECS-based workloads.
- Experience managing observability pipelines at scale in high-throughput environments.
- Familiarity with Configuration-as-Code (Ansible, Chef, or SaltStack) for managing configurations across legacy instances.
- Database performance monitoring experience, particularly in large-scale distributed environments.
Similar Jobs
What We Do
dLocal started with one goal – to close the payments innovation gap between global enterprise companies, and customers in emerging economies. We have over 900 payment methods, in more than 40 countries.
With the ability to accept local payment methods and facilitate cross-border fund settlement worldwide, our merchants reach billions of underserved consumers in the high-growth markets of Africa, Asia, and Latin America. dLocal offers the ideal payment solutions for global commerce:
Payins: Accept local payment methods
Payouts: Compliantly send funds cross-border
Defense Suite: Manage fraud effectively
dLocal for Platforms: Unify your platform’s payment solution
Local Issuing: Localize payments for your gig-economy workers, suppliers, and partners








