TWG Global

Platform / Site Reliability Engineer (UK)

Reposted Yesterday

Be an Early Applicant

London, Greater London, England, GBR

In-Office

95K-95K Annually

Mid level

Angel or VC Firm • Artificial Intelligence • Fintech • Software • Financial Services

The Role

Maintain and scale data and ML infrastructure, build CI/CD pipelines, implement observability, ensure high availability and disaster recovery, manage access and secrets, troubleshoot incidents, and provide 24/7 coverage across time zones.

Summary Generated by Built In

The Organization

At TWG Group Holdings, LLC (“TWG Global”), we drive innovation and business transformation across a range of industries—including financial services, insurance, technology, media, and sports—by leveraging data and AI as core assets. Our AI-first, cloud-native approach delivers real-time intelligence and interactive business applications, empowering informed decision-making for both customers and employees.

We prioritize responsible data and AI practices to ensure ethical standards and regulatory compliance. Our decentralized structure enables each business unit to operate autonomously, supported by a central AI Solutions Group, while strategic partnerships with leading data and AI vendors fuel game-changing efforts in marketing, operations, and product development.

You will collaborate with management to advance our data and analytics transformation, enhance productivity, and enable agile, data-driven decisions. By leveraging relationships with top tech startups and universities, you will help create competitive advantages and drive enterprise innovation.

At TWG Global, your contributions will support our goal of sustained growth and superior returns, as we deliver rare value and impact across our businesses. We’re a fast-growing AI/ML team delivering high-impact use case solutions to financial institutions, insurers, and other regulated enterprises. Backed by proven leaders in finance and national security, our team is scaling rapidly to serve clients across North America with robust, secure, and production-grade AI solutions.

The Role

We are seeking a Platform / Site Reliability Engineer (SRE) to ensure the scalability, stability, and performance of our data platforms and ML infrastructure. You’ll work closely with data scientists, ML engineers, and platform vendors to deploy and monitor production systems, automate workflows, and reduce operational overhead.

Key Responsibilities:

Build and maintain infrastructure to support real-time and batch ML workloads
Implement observability tools (logging, monitoring, alerting) for model performance and system uptime
Design and manage CI/CD pipelines applications
Ensure high availability, disaster recovery, and rollback capabilities for production environments
Manage access controls, secrets, and security policies in collaboration with compliance and IT
Troubleshoot incidents, lead postmortems, and drive root-cause resolution
Work with U.S. and international teams to provide 24/7 coverage across time zones

Requirements

Qualifications:

3–6 years of experience in DevOps, SRE, or backend engineering roles
Proficient with tools like Docker, Kubernetes, Terraform, GitLab/GitHub Actions, Airflow
Strong scripting in Python or Bash and familiarity with Linux environments
Knowledge of observability stacks (e.g., Prometheus, Grafana, ELK, Datadog)
Familiarity with cloud platforms (e.g., AWS, GCP, or Azure)
Strong documentation, problem-solving, and incident response skills

Preferred Qualifications:

Experience supporting ML/AI workflows using Palantir Foundry is a plus (but not required)
Exposure to compliance frameworks like SOC 2, ISO 27001, or financial regulations
Knowledge of MLOps frameworks (e.g., MLflow, Kubeflow, SageMaker Pipelines)
Ability to automate deployments, testing, and monitoring at scale

Benefits

Work on real-world AI applications with high-impact clients
Collaborate with world-class data scientists, engineers, and product leaders
Flat org structure, high trust, high autonomy
Competitive salary + performance-based incentives

Position Location

This is a remote position, but candidates must be currently based in the UK.

Compensation

The target salary for this position is £94,500. A bonus will be included in the compensation package, in addition to the full range of medical, financial, and other benefits.

TWG is an equal opportunity employer. All applicants will be considered for employment without attention to age, race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Skills Required

3-6 years of experience in DevOps, SRE, or backend engineering roles
Proficient with Docker, Kubernetes, Terraform, GitLab/GitHub Actions, Airflow
Strong scripting in Python or Bash and familiarity with Linux environments
Knowledge of observability stacks such as Prometheus, Grafana, ELK, Datadog
Familiarity with cloud platforms (AWS, GCP, or Azure)
Strong documentation, problem-solving, and incident response skills
Experience supporting ML/AI workflows using Palantir Foundry
Exposure to compliance frameworks like SOC 2, ISO 27001, or financial regulations
Knowledge of MLOps frameworks (MLflow, Kubeflow, SageMaker Pipelines)
Ability to automate deployments, testing, and monitoring at scale

View all jobs at TWG Global

View TWG Global Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

54 Employees

What We Do

TWG Global is a unique holding company, strategically investing in and operating businesses across Investment Management, Securities, AI & Technology, Finance & Corporate Lending, Merchant Banking & Private Investments, and Sports, Media & Entertainment. With a diversified portfolio and a proven track record of success, we deliver transformative value through innovation, operational excellence, and disruptive thinking. We empower our portfolio companies to achieve exceptional growth and redefine their industries.