Senior ML Infrastructure Engineer

Posted 5 Days Ago
Be an Early Applicant
Oxford, Oxfordshire, England, GBR
In-Office
Senior level
Artificial Intelligence • Healthtech • Social Impact • Biotech
The Role
As a Senior ML Infrastructure Engineer, you will design, build, and optimize high-performance ML compute clusters, focusing on capacity planning and resource management for ML experimentation.
Summary Generated by Built In

At the Ellison Institute of Technology (EIT), we’re on a mission to translate scientific discovery into real world impact. We bring together visionary scientists, technologists, policy makers, and entrepreneurs to tackle humanity’s greatest challenges in four transformative areas:

  • Health, Medical Science & Generative Biology
  • Food Security & Sustainable Agriculture
  • Climate Change & Managing CO₂
  • Artificial Intelligence & Robotics

This is ambitious work - work that demands curiosity, courage, and a relentless drive to make a difference. At EIT, you’ll join a community built on excellence, innovation, tenacity, trust, and collaboration, where bold ideas become real-world breakthroughs. Together, we push boundaries, embrace complexity, and create solutions to scale ideas for lab to society. Explore more at www.eit.org


Requirements

Our MLOps team

Join our MLOps team to build the cloud and compute foundation that enables scientific breakthroughs. Deliver reliable, secure platforms and self-service guardrails that accelerate experimentation and turn ideas into results—faster, at scale, and with confidence. 

Day-to-day, you might:

  • Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management. 
  • Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation). 
  • Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference. 
  • Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments. 
  • Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines. 

What makes you a great fit:

  • Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale 
  • A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions 
  • Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems 
  • Expertise with high-throughput storage systems for ML/HPC workloads 
  • Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks 
  • A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)

Benefits

We offer the following salary and benefits:

Enhanced holiday pay

Pension

Life Assurance

Income Protection

Private Medical Insurance

Hospital Cash Plan

Therapy Services

Perk Box

Electric Car Scheme

--

Why work for EIT:

At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. Valuing emotional intelligence, empathy, respect, and resilience, we encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!

Skills Required

  • Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
  • Expert-level understanding of GPU architecture and high-speed networking for distributed training
  • Solid grasp of Infrastructure as Code (IaC) and CI/CD practices
  • Proactive approach to systems design and implementation
  • Exposure to modern containerized systems for ML infrastructure
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
0 Employees
Year Founded: 2023

What We Do

The Ellison Institute of Technology aims to discover, develop, and deploy science and technology to solve humanity's most important problems, focusing on health and medical science, food security, climate change, and AI-driven government innovation.

Similar Jobs

BlackRock Logo BlackRock

Quantitative Researcher

Fintech • Information Technology • Financial Services
In-Office
2 Locations
25000 Employees
215K-275K Annually

Graphcore Logo Graphcore

Senior Machine Learning Engineer

Artificial Intelligence • Semiconductor
Hybrid
Cambridge, Cambridgeshire, England, GBR
762 Employees

Graphcore Logo Graphcore

Senior Machine Learning Engineer

Artificial Intelligence • Semiconductor
Hybrid
Bristol, England, GBR
762 Employees

Graphcore Logo Graphcore

Director Of Product Engineering

Artificial Intelligence • Semiconductor
Hybrid
Bristol, England, GBR
762 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account