Site Reliability Engineer

Posted 2 Days Ago
Be an Early Applicant
London, Greater London, England
In-Office
Mid level
Artificial Intelligence • Fintech • Information Technology • Business Intelligence • Financial Services
Xceptor makes data ingestion, data transformation and process digitisation easy.
The Role
As a Site Reliability Engineer, you will ensure service reliability and performance, develop observability standards, automate operational tasks, and collaborate with engineering teams to enhance service quality.
Summary Generated by Built In
ABOUT XCEPTOR

Data is at the heart of everything we do: Xceptor has been designed around data manipulation in its broadest sense. We source data from wherever it flows. We curate, normalise, validate, repair, and enrich that data so it reaches its destination in a reliable and consistent format. Data coming out of Xceptor is data our clients can trust.

We are recognised as an expert in the Financial Services vertical, which strongly aligns with Business Users in Middle and Back-Office teams. We enable these users to solve their data challenges by themselves, rather than through a technology-led project.

Our mission is to empower business users within financial institutions to build automated processes that deliver trusted data.

Our values are:

  • Client Centricity
  • One Team
  • Impactful

🚀 Your Role:

Site Reliability Engineering (SRE) is a cross-cutting function that partners with tribes across Xceptor to make our services reliable, performant, secure, and operable in production. We set and evolve standards for SLOs/SLIs, observability, incident response, and operational controls, and we build automation that reduces toil and enables teams to ship safely at pace across cloud and on-prem deployments. 

Xceptor operates with an AI-first PDLC. AI agents are a digital delivery partner and a member of the team, accelerating how we design, build, test, document, deploy, and operate our services. Reliability is engineered in through standards, automation, and measurable signals, with humans providing intent, constraints, verification, and accountability. 

🧩 What You’ll Be Doing:

As a Site Reliability Engineer at Xceptor, you contribute at tribe level to reliability, performance, and operability. You help build and run the reliability system: observability standards, incident response practices, runbooks, and automation that reduces toil and improves service health over time. 

You partner closely with Software Engineering, QA, Platform Engineering, and Senior/Lead SREs to embed reliability into delivery without becoming a bottleneck. You will own well-scoped operational improvements end-to-end (design, implement, test, roll out, measure) and steadily increase your scope and independence. 

This is an AI-first SRE role. You use AI routinely to accelerate investigation, diagnostics, runbook creation, infrastructure automation, and operational reporting, while staying accountable for verification and safe operation. This role exists to make reliability measurable and repeatable, reduce operational toil through automation, and enable fast delivery without compromising safety, control, or customer trust. 

🎯 Who we're looking for:

Reliability Engineering (Build reliability into the system) 

  • Contributes to defining and improving SLIs/SLOs and service health signals, aligned to customer outcomes. 
  • Implements reliability improvements within established patterns (timeouts, retries, graceful degradation, safe failure modes). 
  • Supports capacity and performance work: basic baselining, load investigation, and scaling hygiene. 
  • Helps maintain operational quality across production and staging, and improves environment consistency where possible. 

 Incident Management & Operational Excellence 

  • Participates in incident response and on-call (as applicable), contributing to triage, mitigation, and recovery. 
  • Produces clear post-incident notes and supports root cause analysis, focusing on actions that prevent recurrence. 
  • Creates and improves runbooks/playbooks so incidents are faster and more consistent to resolve. 
  • Helps improve change safety through practical release/readiness checks and operational guardrails. 

 Observability & Production Signals: 

  • Implements and improves observability for services: logs, metrics, traces, dashboards, and alerting aligned to standards. 
  • Tunes alerts to reduce noise and improve actionability; helps manage flakiness and false positives. 
  • Builds and maintains service health dashboards that support quick diagnosis and release confidence. 
  • Works with QA and Engineering to align operational signals with end-to-end journey health. 

 Automation & Tooling (Make the right thing easy): 

  • Automates repetitive operational tasks and reduces toil through scripts, tooling, and pipeline improvements. 
  • Contributes to deployment automation and reliability guardrails in CI/CD, working with Platform Engineering. 
  • Implements and maintains IaC changes under guidance, ensuring changes are safe, reviewed, and measured. 
  • Improves diagnostics and “day 2” operations to make support and troubleshooting easier. 

 AI-First Operations (How you run SRE): 

  • Uses AI routinely to accelerate operational tasks (investigation, diagnostics, runbooks, automation drafts) with explicit verification. 
  • Works effectively in an “agents draft, humans verify” model for operational artefacts (scripts, dashboards, alerts, incident summaries). 
  • Applies safe operational controls when using AI (no unsafe remediation; careful handling of sensitive data). 
  • Learns from production outcomes and improves automation and guardrails based on real incidents and trends. 

 Collaboration & Enablement: 

  • Partners effectively with engineering teams to embed reliability into delivery without becoming a bottleneck. 
  • Communicates reliability risks and operational impacts clearly, escalating early when needed. 
  • Contributes to shared platform practices and standards across tribes (templates, runbooks, alerting patterns). 
  • Builds strong working relationships with stakeholders to support customer outcomes. 

 

Key Competencies 

 Technical: 

  • Experience supporting and improving production services with reliability and performance expectations. 
  • Working knowledge of cloud and cloud-native operations (Azure preferred), and the fundamentals of running services safely. 
  • Experience with IaC and automation (tooling/framework aligned to your stack), with good review and change discipline. 
  • Familiarity with CI/CD and deployment practices; able to improve pipelines and release safety under guidance. 
  • Practical observability skills: logs/metrics/traces, dashboards, and alert tuning. 
  • Comfortable scripting and automation (e.g., PowerShell, CLI tooling). 

AI-First SRE (Must Have): 

  • Uses AI to accelerate investigation, automation drafts, and runbook creation, and verifies outputs before use. 
  • Can follow and contribute to repeatable operational workflows and templates that improve reliability over time. 
  • Understands and mitigates AI risks in operations (unsafe actions, false confidence, confidentiality). 

 Non Technical: 

  • Calm, pragmatic, and reliable; communicates clearly during incidents and operational issues. 
  • Outcome-focused with a bias for automation and systemic fixes over manual effort. 
  • Collaborative and receptive to feedback; grows quickly in a high-tempo environment. 
  • Customer-aware mindset suitable for regulated, mission-critical environments. 

Required Education & Experience 

  • Experience as an SRE / DevOps / Production Engineer (typically 2–5 years). 
  • Experience supporting cloud services and operational automation in production environments; Azure experience beneficial. 
  • Experience contributing to CI/CD, IaC, and observability practices in a delivery team. 
  • Strong academic background, including a degree in a STEM subject discipline, or equivalent experience. 

How Success Will be Measures 

This role is measured on outcomes and how they’re achieved: improving reliability and operational signal quality, reducing toil through automation, and supporting controlled change in an AI-first operating model. 

  • Reliability: SLO attainment, availability/performance trends, incident frequency/severity trend, and MTTR improvements 
  • Change safety: change failure rate and rollback rate improve; releases become safer and more predictable 
  • Observability: alert signal-to-noise improves (flake/noise down), coverage of key services/journeys increases, faster diagnosis from logs/metrics/traces 
  • Toil reduction: automation increases, manual operational overhead reduces, runbooks/playbooks drive consistent response 
  • Cost & capacity: capacity planning maturity improves; cost optimisation without risking SLOs 
  • Behaviours: AI-first by default (agents draft, humans verify); strong verification discipline; reliable incident participation; automation mindset; control-aware and security-conscious decisions 

 

Associated Values and Behaviours 

  • Collaboration: Encourage teamwork and knowledge sharing. 
  • Innovation: Support the exploration of new ideas and technologies. 
  • IntegrityMaintain transparency and ethical behavior in all decisions. 
  • Accountability: Take ownership of responsibilities and results. 
  • Respect: Value diverse perspectives and contributions. 
  • Continuous Improvement: Strive for excellence in processes and outcomes. 
  • Customer Focus: Prioritize solutions that meet or exceed customer expectations. 

🌍 Diversity & Inclusion at Xceptor

We believe great ideas come from everywhere — and that the best teams are made up of people with different backgrounds, experiences, and perspectives. At Xceptor, we’re committed to building a workplace where everyone feels welcome, valued, and empowered to be themselves.

We know that not everyone ticks every single box in a job description — and that’s okay. If you’re excited about this role and think you could make a difference, we’d love to hear from you. Your unique skills and experiences might be just what we need, even if you don’t meet every requirement.

We celebrate diversity and are dedicated to creating an inclusive environment for all employees — regardless of race, gender identity, sexual orientation, age, disability, religion, or background.

#LI-GL1 #LI-Hybrid


Please note:

  • Xceptor works with clients in financial services and our offers of employment are subject to the satisfactory completion of background checks, which includes criminal record checks, and credit reference checks.
  • If you have any employment gaps exceeding three months within the last six years, we will request additional information and evidence to clarify those periods.

Top Skills

AI
Azure
Ci/Cd
Cli Tooling
Iac
Powershell
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
New York City, NY
190 Employees
Year Founded: 2003

What We Do

Xceptor delivers no-code data automation software across the enterprise. We make data ingestion, data transformation and process digitisation easy. Our platform has the power to automate even the most complex processes, end-to-end, with a single platform.

Proven by our customers. Validated by our partners. Powering industry utilities.

Why Work With Us

We offer exposure to global, blue-chip clients and an unparalleled experience of best practice. This is a fast-paced, dynamic, and highly collaborative environment where there are significant opportunities for growth and development.

Gallery

Gallery

Similar Jobs

MarketAxess Logo MarketAxess

Site Reliability Engineer

Fintech • Information Technology • Financial Services
Hybrid
London, Greater London, England, GBR
892 Employees
115K-175K Annually

Optum Logo Optum

Site Reliability Engineer

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
In-Office
London, England, GBR
160000 Employees

Citadel Logo Citadel

Site Reliability Engineer

Information Technology • Software • Financial Services • Big Data Analytics
In-Office
London, Greater London, England, GBR
4000 Employees
105K-300K Annually

Deutsche Bank Logo Deutsche Bank

Site Reliability Engineer

Fintech • Financial Services
In-Office
Birmingham, West Midlands, England, GBR
68787 Employees

Similar Companies Hiring

Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account