Capacity and Performance Reliability Manager

Reposted 10 Days Ago
Be an Early Applicant
London, Greater London, England
In-Office
Senior level
Fintech • Financial Services
The Role
Responsible for ensuring the capacity and performance of IT infrastructure, conducting forecasting, performance tuning, and continuous improvement of operational resilience. Collaborates with teams on reliability metrics and compliance.
Summary Generated by Built In
Capacity and Performance Reliability Manager

Shift Pattern:

Standard 40 Hour Week (United Kingdom)

Scheduled Weekly Hours:

40

Corporate Grade:

D - Assistant Vice President

Reporting Line:

(UK Division) Information Technology

Location:

UK-London

Worker Type:

Permanent

About the London Metal Exchange

The London Metal Exchange (LME) is the world centre for industrial metals trading. Most of the world’s global non-ferrous futures business is conducted on the LME’s three trading platforms totalling $18 trillion, 178 million lots and 4 billion tonnes with a market open interest high of 1.8 million lots in 2024.

The metals community uses the LME, an HKEX Group company, as a venue to transfer or take on price risk, as a physical market of last resort and as the provider of transparent global reference prices.

www.lme.com

Overall Purpose of Role:

Capacity Management at the LME is a key function, linked to strict regulatory compliance requirements, to actively manage multiple environments. With a large virtual estate encompassing multiple VMWare Clusters and OpenShift Containers Platform (OCP), the Capacity and Performance Reliability Engineer is key to ensure the stability of the platforms.

The Capacity and Performance Reliability Engineer is responsible for ensuring the reliability, availability, and performance of all infrastructure and services, proactively identifying and mitigating risks, and driving continuous improvement in operational resilience and service quality. This includes maintenance of the capacity management tool suite, capacity reporting, trend analysis and forecasting, Ad-hoc performance investigations, demand management, and governance of the relevant processes and policies.

The Capacity and Performance Reliability Engineer  must have extensive knowledge of trading technologies and the operation of a trading value, with the ability to incorporate business metrics and knowledge into the technical metrics from the LME core systems.

Responsibilities

Capacity Planning & Performance Management

  • Use historical data and predictive analytics to forecast demand and plan capacity for all environments (virtual, containerised, and physical).
  • Perform stress testing, scenario modelling, and performance tuning to ensure systems can handle peak loads.
  • Automate scaling, resource allocation, and infrastructure provisioning using Infrastructure as Code (IaC) and cloud-native tools.as Code (IaC) and cloud-native tools.
  • Maintain and enhance the Capacity Management tool suite (e.g., Athene, Grafana), ensuring zero data loss and maximum automation.

Collaboration & Continuous Improvement

  • Work closely with development, operations, and business teams to embed reliability and capacity considerations into system design and delivery.
  • Promote best practices in automation, observability, and incident management.
  • Present findings, reports, and recommendations to business heads, service managers, and technical teams.
  • Build relationships with internal and external stakeholders, including architects, testing teams, service managers, project sponsors, and third-party suppliers.

Metrics, Reporting & Governance

  • Produce regular service and infrastructure capacity plans, reliability reports, and recommendations for action.
  • Own and manage the Capacity Management Recommendations tracker.
  • Report on reliability metrics, incidents, and system health to senior management.
  • Ensure compliance with regulatory requirements and internal governance standards.

Reliability Engineering & System Health

  • Develop, implement, and maintain Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for critical services.
  • Design and manage monitoring, alerting, and observability solutions to detect and prevent failures.
  • Lead incident response, conduct blameless post-incident reviews, and drive corrective actions to prevent recurrence.
  • Champion a reliability-focused, automation-first culture across teams.

Professional Qualifications Required:

  • Educated to degree standard and/or 5+ years of performance and capacity experience
  • ITIL Foundation Certification
  • Currently holds or has previously held a similar position
  • Experience in reliability engineering, site reliability engineering (SRE), or similar roles is highly desirable.

Required Knowledge and Level of Experience:

·         At least 5 years’ experience in performance, capacity, or reliability management within a business-critical global banking, financial services, or technology environment.

·         In-depth knowledge of trading technologies, system performance, and their linkage to business metrics.

·         Proven experience with capacity forecasting, modelling, and analysis techniques.

·         Strong analytical skills for transforming machine data into actionable insights.

·         Experience managing relationships at all levels, from technical specialists to non-technical business representatives.

·         Proficiency with monitoring and automation tools (e.g., Athene, Grafana, Prometheus, DataDog, Terraform, Kubernetes, CI/CD pipelines).

·         Significant SQL knowledge and high-level expertise in Excel.

·         Ability to code in programming and query languages (e.g., Visual Basic, MS SQL, Python).

·         Understanding of APIs and automation scripting.

·         Knowledge of cloud architecture, containers, orchestration, and agile/CICD practices is desirable.

Skills Set and Core Competencies

·         Demonstrated ability to deliver innovative solutions supporting business and service operations.

·         Excellent communication skills, with the ability to prepare and present clear, concise, and effective reports for senior management.

·         Highly numerate, with strong statistical analysis and system modelling techniques.

·         Experience in business and service capacity management, reliability engineering, and performance optimisation.

·         Comprehensive understanding of queueing theory and system modelling.

·         Collaborative, improvement-oriented mindset with a passion for data and technology.

·         Ability to work independently or as part of a team, taking pride in individual and team deliverables.

·         Flexible yet structured approach to problem-solving, with the ability to analyse complex problems and identify suitable solutions.

·         Well-organised, self-motivated, and enthusiastic about reliability and capacity management.

The LME is committed to creating a diverse environment and is proud to be an equal opportunity employer. In recruiting for our teams, we welcome the unique contributions that you can bring in terms of education, ethnicity, race, sex, gender identity, expression and reassignment, nation of origin, age, languages spoken, colour, religion, disability, sexual orientation and beliefs. In doing so, we want every LME employee to feel our commitment to showing respect for all and encouraging open collaboration and communication.

Top Skills

Athene
Ci/Cd
Datadog
Grafana
Kubernetes
Ms Sql
Openshift
Prometheus
Python
Terraform
Visual Basic
VMware
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Hong Kong, Hong Kong
1,723 Employees
Year Founded: 2000

What We Do

HKEX Group is a global exchange group, operating dynamic and integrated financial markets in Asia and Europe.

From our home in the financial hub of Hong Kong and an additional base in London, we provide world-class facilities for trading and clearing securities and derivatives in Equities, Commodities, Fixed Income and Currency.

Uniquely positioned at the intersection of Chinese and international capital flows, Hong Kong has long been Connecting China with the World. With the accelerated opening-up of China’s capital markets, HKEX continues to be at the forefront of this historic transition, which we believe will Shape the Global Market Landscape

Similar Jobs

CrowdStrike Logo CrowdStrike

Security Advisor II (Remote, GBR)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
United Kingdom
10000 Employees

Capco Logo Capco

Designer

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Hybrid
London, England, GBR
6000 Employees

Capco Logo Capco

Communication and Engagement Specialist

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Hybrid
London, England, GBR
6000 Employees

Capco Logo Capco

Consultant

Fintech • Professional Services • Consulting • Energy • Financial Services • Cybersecurity • Generative AI
Hybrid
London, England, GBR
6000 Employees

Similar Companies Hiring

Rain Thumbnail
Web3 • Payments • Infrastructure as a Service (IaaS) • Fintech • Financial Services • Cryptocurrency • Blockchain
New York, NY
80 Employees
Granted Thumbnail
Mobile • Insurance • Healthtech • Financial Services • Artificial Intelligence
New York, New York
23 Employees
Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account