We are looking for Site Reliability Engineer to join us at Thales and work with our Payment Solutions. The Site Reliability Engineer empowers product, delivery, and SRE teams to implement a holistic observability approach across AWS and GCP. We design observability standards, build reusable frameworks and partner with teams to achieve end-to-end visibility—from Node.js and Java services to business outcomes. Our mission: make service performance measurable, detect incidents proactively, and accelerate investigations with trustworthy telemetry.
Day in Life of SRE:
Build and maintain observability frameworks for AWS/GCP
- Create reusable Datadog instrumentation for Node.js and Java
- Provide auto-instrumentation templates and enforce observability quality standards
- Publish Terraform modules for Datadog resources and cloud integrations
Own Datadog dashboards and measurement standards
- Define and curate source-of-truth dashboards and KPIs
- Establish golden signals and semantic conventions across services
- Manage observability-as-code repos in GitLab
Improve monitoring, alerting, and incident readiness
- Design precise, low-noise Datadog monitors and routing
- Implement synthetics for critical flows and correlate with traces/logs
- Partner with SREs on SLOs, error budgets, and incident triggers
Drive continuous learning and adoption
- Turn post-incident learnings into improved monitors, dashboards, and CI/CD checks
- Deliver training, documentation, and hands-on support for developers and SREs
Consult, enable, and optimize
- Coach teams on instrumentation and APM best practices
- Strengthen AWS/GCP observability integrations and tagging strategy
- Optimize Datadog cost, sampling, retention, and cardinality; rationalize monitors
Typical interactions:
SRE: alert quality, troubleshooting, SLOs, post-incident reviews
Product/Dev: instrumentation, trace propagation, business KPIs
Platform/Infra: cloud integrations, Terraform, RBAC, cost/performance
Security/Compliance: telemetry governance, PII controls, retention policies
Leadership: service health roll-ups, reliability and adoption metrics
Skills & experience:
Strong engineering background in Node.js and/or Java (Datadog dd-trace, async context propagation, middleware patterns)
Cloud expertise in AWS — serverless, containers, managed services, and integrating cloud telemetry with Datadog
Automation skills with GitLab CI/CD and Terraform (Datadog resources, modules, workflows)
Datadog proficiency — APM, logs, metrics, synthetics, monitors, SLOs, and observability-as-code practices
Observability mindset — defining SLIs/SLOs, improving alert quality, and supporting the full incident lifecycle
Strong communication skills — clear documentation, training delivery, and confident English communication with distributed teams
Top Skills
What We Do
Thales is a global high technology leader investing in digital and “deep tech” innovations – connectivity, big data, artificial intelligence, cybersecurity and quantum technology – to build a future we can all trust, which is vital to the development of our societies. The company provides solutions, services and products that help its customers – businesses, organisations and states – in the defence, aeronautics, space, transportation and digital identity and security markets to fulfil their critical missions, by placing humans at the heart of the decision-making process.








