Xceptor

Site Reliability Engineer

Posted 2 Days Ago

Be an Early Applicant

London, Greater London, England

In-Office

Mid level

Artificial Intelligence • Fintech • Information Technology • Business Intelligence • Financial Services

Xceptor makes data ingestion, data transformation and process digitisation easy.

The Role

As a Site Reliability Engineer, you will ensure service reliability and performance, develop observability standards, automate operational tasks, and collaborate with engineering teams to enhance service quality.

Summary Generated by Built In

ABOUT XCEPTOR

Data is at the heart of everything we do: Xceptor has been designed around data manipulation in its broadest sense. We source data from wherever it flows. We curate, normalise, validate, repair, and enrich that data so it reaches its destination in a reliable and consistent format. Data coming out of Xceptor is data our clients can trust.

We are recognised as an expert in the Financial Services vertical, which strongly aligns with Business Users in Middle and Back-Office teams. We enable these users to solve their data challenges by themselves, rather than through a technology-led project.

Our mission is to empower business users within financial institutions to build automated processes that deliver trusted data.

Our values are:

Client Centricity
One Team
Impactful

🚀 Your Role:

Site Reliability Engineering (SRE) is a cross-cutting function that partners with tribes across Xceptor to make our services reliable, performant, secure, and operable in production. We set and evolve standards for SLOs/SLIs, observability, incident response, and operational controls, and we build automation that reduces toil and enables teams to ship safely at pace across cloud and on-prem deployments.

Xceptor operates with an AI-first PDLC. AI agents are a digital delivery partner and a member of the team, accelerating how we design, build, test, document, deploy, and operate our services. Reliability is engineered in through standards, automation, and measurable signals, with humans providing intent, constraints, verification, and accountability.

🧩 What You’ll Be Doing:

As a Site Reliability Engineer at Xceptor, you contribute at tribe level to reliability, performance, and operability. You help build and run the reliability system: observability standards, incident response practices, runbooks, and automation that reduces toil and improves service health over time.

You partner closely with Software Engineering, QA, Platform Engineering, and Senior/Lead SREs to embed reliability into delivery without becoming a bottleneck. You will own well-scoped operational improvements end-to-end (design, implement, test, roll out, measure) and steadily increase your scope and independence.

This is an AI-first SRE role. You use AI routinely to accelerate investigation, diagnostics, runbook creation, infrastructure automation, and operational reporting, while staying accountable for verification and safe operation. This role exists to make reliability measurable and repeatable, reduce operational toil through automation, and enable fast delivery without compromising safety, control, or customer trust.

🎯 Who we're looking for:

Reliability Engineering (Build reliability into the system)

Contributes to defining and improving SLIs/SLOs and service health signals, aligned to customer outcomes.
Implements reliability improvements within established patterns (timeouts, retries, graceful degradation, safe failure modes).
Supports capacity and performance work: basic baselining, load investigation, and scaling hygiene.
Helps maintain operational quality across production and staging, and improves environment consistency where possible.

Incident Management & Operational Excellence

Participates in incident response and on-call (as applicable), contributing to triage, mitigation, and recovery.
Produces clear post-incident notes and supports root cause analysis, focusing on actions that prevent recurrence.
Creates and improves runbooks/playbooks so incidents are faster and more consistent to resolve.
Helps improve change safety through practical release/readiness checks and operational guardrails.

Observability & Production Signals:

Implements and improves observability for services: logs, metrics, traces, dashboards, and alerting aligned to standards.
Tunes alerts to reduce noise and improve actionability; helps manage flakiness and false positives.
Builds and maintains service health dashboards that support quick diagnosis and release confidence.
Works with QA and Engineering to align operational signals with end-to-end journey health.

Automation & Tooling (Make the right thing easy):

Automates repetitive operational tasks and reduces toil through scripts, tooling, and pipeline improvements.
Contributes to deployment automation and reliability guardrails in CI/CD, working with Platform Engineering.
Implements and maintains IaC changes under guidance, ensuring changes are safe, reviewed, and measured.
Improves diagnostics and “day 2” operations to make support and troubleshooting easier.

AI-First Operations (How you run SRE):

Uses AI routinely to accelerate operational tasks (investigation, diagnostics, runbooks, automation drafts) with explicit verification.
Works effectively in an “agents draft, humans verify” model for operational artefacts (scripts, dashboards, alerts, incident summaries).
Applies safe operational controls when using AI (no unsafe remediation; careful handling of sensitive data).
Learns from production outcomes and improves automation and guardrails based on real incidents and trends.

Collaboration & Enablement:

Partners effectively with engineering teams to embed reliability into delivery without becoming a bottleneck.
Communicates reliability risks and operational impacts clearly, escalating early when needed.
Contributes to shared platform practices and standards across tribes (templates, runbooks, alerting patterns).
Builds strong working relationships with stakeholders to support customer outcomes.

Key Competencies

Technical:

Experience supporting and improving production services with reliability and performance expectations.
Working knowledge of cloud and cloud-native operations (Azure preferred), and the fundamentals of running services safely.
Experience with IaC and automation (tooling/framework aligned to your stack), with good review and change discipline.
Familiarity with CI/CD and deployment practices; able to improve pipelines and release safety under guidance.
Practical observability skills: logs/metrics/traces, dashboards, and alert tuning.
Comfortable scripting and automation (e.g., PowerShell, CLI tooling).

AI-First SRE (Must Have):

Uses AI to accelerate investigation, automation drafts, and runbook creation, and verifies outputs before use.
Can follow and contribute to repeatable operational workflows and templates that improve reliability over time.
Understands and mitigates AI risks in operations (unsafe actions, false confidence, confidentiality).

Non Technical:

Calm, pragmatic, and reliable; communicates clearly during incidents and operational issues.
Outcome-focused with a bias for automation and systemic fixes over manual effort.
Collaborative and receptive to feedback; grows quickly in a high-tempo environment.
Customer-aware mindset suitable for regulated, mission-critical environments.

Required Education & Experience

Experience as an SRE / DevOps / Production Engineer (typically 2–5 years).
Experience supporting cloud services and operational automation in production environments; Azure experience beneficial.
Experience contributing to CI/CD, IaC, and observability practices in a delivery team.
Strong academic background, including a degree in a STEM subject discipline, or equivalent experience.

How Success Will be Measures

This role is measured on outcomes and how they’re achieved: improving reliability and operational signal quality, reducing toil through automation, and supporting controlled change in an AI-first operating model.

Reliability: SLO attainment, availability/performance trends, incident frequency/severity trend, and MTTR improvements
Change safety: change failure rate and rollback rate improve; releases become safer and more predictable
Observability: alert signal-to-noise improves (flake/noise down), coverage of key services/journeys increases, faster diagnosis from logs/metrics/traces
Toil reduction: automation increases, manual operational overhead reduces, runbooks/playbooks drive consistent response
Cost & capacity: capacity planning maturity improves; cost optimisation without risking SLOs
Behaviours: AI-first by default (agents draft, humans verify); strong verification discipline; reliable incident participation; automation mindset; control-aware and security-conscious decisions

Associated Values and Behaviours

Collaboration: Encourage teamwork and knowledge sharing.
Innovation: Support the exploration of new ideas and technologies.
Integrity: Maintain transparency and ethical behavior in all decisions.
Accountability: Take ownership of responsibilities and results.
Respect: Value diverse perspectives and contributions.
Continuous Improvement: Strive for excellence in processes and outcomes.
Customer Focus: Prioritize solutions that meet or exceed customer expectations.

🌍 Diversity & Inclusion at Xceptor

We believe great ideas come from everywhere — and that the best teams are made up of people with different backgrounds, experiences, and perspectives. At Xceptor, we’re committed to building a workplace where everyone feels welcome, valued, and empowered to be themselves.

We know that not everyone ticks every single box in a job description — and that’s okay. If you’re excited about this role and think you could make a difference, we’d love to hear from you. Your unique skills and experiences might be just what we need, even if you don’t meet every requirement.

We celebrate diversity and are dedicated to creating an inclusive environment for all employees — regardless of race, gender identity, sexual orientation, age, disability, religion, or background.

#LI-GL1 #LI-Hybrid

Please note:

Xceptor works with clients in financial services and our offers of employment are subject to the satisfactory completion of background checks, which includes criminal record checks, and credit reference checks.
If you have any employment gaps exceeding three months within the last six years, we will request additional information and evidence to clarify those periods.

Top Skills

Azure

Ci/Cd

Cli Tooling

Iac

Powershell

View all jobs at Xceptor

View Xceptor Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

New York City, NY

190 Employees

Year Founded: 2003

What We Do

Xceptor delivers no-code data automation software across the enterprise. We make data ingestion, data transformation and process digitisation easy. Our platform has the power to automate even the most complex processes, end-to-end, with a single platform.

Proven by our customers. Validated by our partners. Powering industry utilities.

Why Work With Us

We offer exposure to global, blue-chip clients and an unparalleled experience of best practice. This is a fast-paced, dynamic, and highly collaborative environment where there are significant opportunities for growth and development.