Senior Platform SRE

Posted Yesterday
Be an Early Applicant
Bangalore, Bengaluru Urban, Karnataka, IND
In-Office
Senior level
Fintech • Payments • Financial Services
The Role
Drive IG's reliability platform: implement OpenTelemetry-based observability, own SLOs/error budgets, build CI/CD and self-healing automation, run chaos experiments, mentor engineers, lead incident response and blameless post-incident reviews, and set organisation-wide SRE standards.
Summary Generated by Built In

Job Title

Senior Platform SRE

Job Description

So, who are we?

IG has been at the centre of retail trading and investment since 1974, when we helped create the market for financial spread betting. Today, we're a FTSE100 fintech operating across five continents, serving over 700,000 clients and handling billions in transactions - built on decades of scale, trust and proof. We didn't pivot to innovation; it's how we've always operated. What that means for the people who work here is real: genuinely complex problems to solve, the technology and resources to tackle them properly, and the kind of scope that’s rare in established businesses. The bar is high - bring a curious and forward-thinking mindset and we'll give you the platform to define what comes next. Join us at IG – the futuregets built here.

Your team

The Platform SRE team is the engine of IG’s reliability programme. We sit within Infrastructure & Operations, working across IG’s hybrid estate of on-premises HashiCorp Nomad and AWS.  

We are not a reactive ops team. We build the platform, standards, and tooling that make reliability the default for every engineering team at IG. Through the SRE Guild, we connect with Domain SREs and Reliability Champions across the organisation, setting the bar and lifting it together. 

Your role in the Team's Success

You will be a hands-on technical contributor at the heart of the Platform SRE team, owning pieces of the reliability platform that hundreds of engineers depend on. You will work at the intersection of software engineering, observability, and systems reliability, turning reliability from a reactive concern into a proactive engineering discipline. 

You will partner with Platform Engineering, product teams, and Reliability Champions to define what good looks like in production and then make it the default. You will contribute to the SRE Guild, mentor engineers across the organisation, and when things go wrong, you will be on the call helping to mitigate, understand, and prevent a repeat. 

What you'll do

Build and own the reliability platform 

  • Implement comprehensive monitoring and observability using OpenTelemetry and distributed tracing. Maintain SLO, error budgets and burn-rate tracking 

  • Establish and maintain 24/7 operational readiness including automated deployments, blue/green releases, and zero-downtime patching strategies 

  • Engineer self-healing capabilities: auto-remediation, error-budget-gated rollback, and automated traffic rerouting 

  • Design and run chaos experiments across the AWS estate, turning severe-but-plausible failure scenarios into engineering improvements 

  • Build automation tools and CI/CD pipelines that embed reliability practices, while applying software engineering discipline including version control, code reviews, and testing. 

  • Contribute to the SRE AI agent, IG’s agentic tooling for incident investigation and reliability review, built on AWS frontier models 

  • Mentor junior SREs and Reliability Champions on reliability patterns and production engineering discipline 

 

Set and uphold standards 

  • Author and evolve the SRE standards that underpin the Guild: SLO methodology, error budget policy, observability instrumentation guide, and Production Readiness Review (PRR) checklist 

  • Mentor developers on reliability patterns including circuit breakers, retry logic, and fault tolerance 

  • Work with development teams and Reliability Champions to design SLOs on customer journeys rather than per-service.  

  • Assist and guide teams in system design, capacity planning, architectural reviews and closing observability gaps. 

 

Own incident response and learning 

  • Facilitate blameless post-incident reviews (PIRs) within five working days using contributing-factor methodology 

  • Maintain the Lessons Register, track remediation actions to closure, and surface patterns across incidents quarterly 

What you'll need for this role

Essential Technical Skills 

  • Observability and instrumentation: hands-on OpenTelemetry experience (spans, metrics, traces, context propagation) and production use of Honeycomb, Datadog, Dynatrace, or Grafana; able to instrument Java or Python services directly. 

  • SLOs and error budgets: proven track record designing customer-meaningful SLIs, setting error budgets, configuring multi-window burn-rate alerts, and working with development teams on reliability measurement 

  • CI/CD and release engineering: experience building pipelines with safety mechanisms: blue/green and canary releases, automated rollback, and DORA metrics integration 

  • Container orchestration: Kubernetes (EKS, AKS, or GKE) required; HashiCorp Nomad is a strong advantage on IG’s hybrid estate; solid understanding of cloud networking and IaC (Terraform preferred) 

  • Software engineering: production-quality coding in Java and/or Python; comfortable contributing to application codebases to implement reliability patterns, not just configuring infrastructure around them 

  • Distributed systems: strong understanding of how large-scale systems fail and how to make them fail safely; circuit breakers, bulkheads, idempotency, graceful degradation, and load-shedding; high-throughput, low-latency environments preferred 

  • Incident management: on-call experience on production systems, blameless PIR facilitation, contributing-factor analysis, and driving action items to closure; PagerDuty and ServiceNow familiarity helpful 

  • Chaos engineering: experience designing and executing hypothesis-driven experiments with blast-radius controls and gap-to-impact-tolerance analysis; AWS FIS, Gremlin, or equivalent 

  • Community and standards: at ease in a guild or community-of-practice model; comfortable writing RFCs, presenting at engineering forums, and building standards that others will adopt 

 

Experience Requirements 

  • Track record in high-throughput, production environments (financial services, trading platforms, or similar mission-critical systems preferred) 

  • Demonstrated ability to improve system reliability and performance at scale 

  • Experience working collaboratively with development teams to implement observability and reliability improvements 

  • Strong troubleshooting skills in distributed systems environments 

Core Competencies 

  • Systems thinking approach to problem-solving 

  • Excellent communication skills for cross-functional collaboration and technical enablement 

  • Ability to balance hands-on development work with operational responsibilities 

  • Strong bias toward automation and eliminating manual toil 

How we work

We try to take a thoughtful approach to our ways of working as a company. We follow a hybrid working model with 3 days in the office -- which we think balances the need to collaborate effectively and connect with each other. When it comes to how we deliver, there are 5 things we want everyone to do to drive high performance, better learning and career satisfaction:

  • Lead and Inspire: Drives trust, alignment, and enthusiasm
  • Think Big: Focus on the problems that most impact commercial outcomes
  • Champion the client: Understand and prioritise client's needs
  • Deliver at pace: Push for fast, sustainable growth;
  • Raise the bar: Take ownership, be accountable and share feedback

We believe that diversity is vital to success, it fuels creativity, drives innovation and sets us up for global success. We're committed to building teams with a variety of perspectives and skills to help us realise our vision and strategy, that's why we encourage applications from people with diverse backgrounds and experiences to join us on this journey. Learn more about our D&I approach here.

The Perks

Your growth fuels our success! Thrive with tailored development programs, mentoring opportunities with leaders, and clear career progression. Expand your network through committees, sports and social clubs. Enjoy extra time off for volunteering and community work.

Learn more about the Perks here!

Join us for this exciting journey. Apply now!

Number of openings

1

Skills Required

  • Hands-on OpenTelemetry experience including spans, metrics, traces, and context propagation; instrument Java or Python services directly
  • Production use of observability tools such as Honeycomb, Datadog, Dynatrace, or Grafana
  • Designing SLIs/SLOs, error budgets, multi-window burn-rate alerts, and reliability measurement with development teams
  • Experience building CI/CD pipelines with safety mechanisms (blue/green, canary, automated rollback) and DORA metrics integration
  • Kubernetes experience (EKS, AKS, or GKE)
  • HashiCorp Nomad experience
  • Infrastructure as Code experience (Terraform preferred); solid understanding of cloud networking
  • Production-quality coding in Java and/or Python and willingness to contribute to application codebases
  • Deep understanding of distributed systems reliability patterns (circuit breakers, bulkheads, idempotency, graceful degradation, load-shedding)
  • On-call experience for production systems, facilitation of blameless PIRs, and incident remediation tracking (PagerDuty/ServiceNow familiarity helpful)
  • Chaos engineering experience; designing and executing hypothesis-driven experiments with blast-radius controls (AWS FIS, Gremlin, or equivalent)
  • Experience mentoring engineers, writing RFCs, presenting to engineering forums, and contributing to community standards/guilds
  • Track record in high-throughput, mission-critical production environments (financial services or similar preferred)
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: London
2,748 Employees

What We Do

We’ve been at the forefront of trading innovation since 1974, taking on the challenge to deliver an unmatched experience for our clients and raise the bar for tomorrow’s opportunities. Today, we’re a global fintech company incorporating the IG, tasty, IG Prime, Spectrum and DailyFX brands, with a presence in 18 countries across five continents – Europe, North America, Africa, Asia-Pacific and the Middle East. We’re an organisation of positive problem-solvers, united and inspired by our purpose, which is to power the pursuit of financial freedom for the ambitious. Our award-winning products and platforms empower go-getters the world over to unlock opportunities around the clock, giving them access to over 19,000 financial markets. Today, more than 400,000 clients call IG Group home. IG Group Holdings plc is an established member of the FTSE 250 and holds a long-term investment grade credit rating of BBB- with a stable outlook from Fitch Ratings

Similar Jobs

Disseqt AI Logo Disseqt AI

Site Reliability Engineer

Artificial Intelligence • Enterprise Web • Software • Generative AI
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
19 Employees

TD SYNNEX Logo TD SYNNEX

Site Reliability Engineer

Information Technology • Software
In-Office
3 Locations
22000 Employees

Ericsson Logo Ericsson

Senior Software Architect

Cloud • Information Technology • Internet of Things • Machine Learning • Software • Cybersecurity • Infrastructure as a Service (IaaS)
In-Office
5 Locations
88000 Employees

Wells Fargo Logo Wells Fargo

Senior Software Engineer

Fintech • Financial Services
Hybrid
Bengaluru, Bengaluru Urban, Karnataka, IND
205000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Kepler  Thumbnail
Fintech • Software
New York, New York
6 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account