DevOps Engineer

Posted 5 Days Ago
Hiring Remotely in United Kingdom
Remote
Mid level
Artificial Intelligence • Information Technology
The Role
Design, operate, and improve Kubernetes-based AI infrastructure, manage GPU environments, ensure reliability, implement automation, and enhance customer experience.
Summary Generated by Built In

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations


Role Summary:

We are seeking a DevOps Engineer to design, operate, and continuously improve our Kubernetes-based AI infrastructure. This role focuses on cloud-native platform engineering, GPU-accelerated workloads, reliability, automation, and customer enablement.

You will play a key role in delivering a production-grade AI platform that enables ML engineers, data scientists, and enterprise customers to build and run AI workloads at scale.

 

You will be responsible for the reliability, scalability, and performance of our Kubernetes-based GPU platforms. You will ensure our AI platform operates securely and efficiently while delivering an exceptional customer experience. This is a hands-on platform engineering position focused on systems reliability, automation, and continuous improvement.


Key Responsibilities:

 

Kubernetes Platform Operations:

  • Operate and evolve a production Kubernetes environment supporting GPU-accelerated AI workloads.
  • Manage cluster lifecycle (deployment, upgrades, scaling, resilience, multi-node operations).
  • Implement high availability, failover, and maintenance strategies to minimize disruption.
  • Enable aaS capabilities and segmentation for multi-tenant workloads.
  • Infrastructure as code tooling and lifecycle.
  • Network Overlays, Storage: Block, File and Object.
  • Experience with Ansible, YAML, Terraform, Python, Jenkins and GitOps.

 

GPU & AI Infrastructure Engineering:

  • Manage NVIDIA GPU infrastructure within Kubernetes (device plugins, drivers, CUDA compatibility).
  • Implement GPU partitioning and workload isolation strategies (e.g., MIG, quotas, namespaces).
  • Monitor and optimize GPU utilization, workload efficiency, and cluster capacity.
  • Support AI/ML training and inference workloads with performance tuning and best practices.

 

Reliability, Monitoring & Automation:

  • Design and maintain observability frameworks (metrics, logs, tracing).
  • Implement proactive monitoring, alerting, and capacity planning.
  • Lead incident response for platform-level events and drive root cause analysis.
  • Automate operational workflows and infrastructure provisioning (IaC, configuration management).
  • Contribute to platform reliability engineering practices (SLOs, SLAs, error budgets).

 

Security & Governance:

  • Implement RBAC, network policies, and security hardening.
  • Ensure secure multi-tenant workload isolation.
  • Maintain compliance, data protection, and access governance standards.

 

 Customer & Platform Enablement:

  • Support customer lifecycle of onboarding, provisioning and operations.
  • Provide guidance on workload configuration, scaling strategies, and best practices.
  • Collaborate with engineering and vendor teams to resolve complex platform issues.
  • Produce high-quality technical documentation and operational playbooks. 


Required Experience & Skills:

  • Strong hands-on experience operating production Kubernetes clusters.
  • Experience with GPU-enabled Kubernetes environments.
  • Solid Linux system administration, networking, storage and security skills.
  • Experience with Infrastructure as Code and automation.
  • Strong understanding of distributed systems, APIs, and cloud-native architectures.
  • Experience implementing monitoring and observability solutions (e.g., Prometheus, Grafana.
  • Proven incident management and root cause analysis experience.
  • Strong communication skills and ability to work cross-functionally.
  •  

Desirable Experience:

  • Experience operating AI/HPC infrastructure.
  • Deep understanding of Kubernetes scheduling, networking, and storage.
  • Experience with high-performance datacentre networking and tuning.
  • Background in DevOps or Site Reliability Engineering (SRE).

 

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

 

Diversity & Inclusion:

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

 

Top Skills

Ansible
Gitops
Gpu
Grafana
Jenkins
Kubernetes
Prometheus
Python
Terraform
Yaml
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Rugby
16 Employees

What We Do

Carbon3.ai is building the UK’s sovereign AI platform – secure, sustainable, and designed for real-world impact.

AI growth demands are creating new challenges and compute power requirements are outpacing supply. At Carbon3.ai, we’re not just providing infrastructure, we’re building the foundations to overcome these challenges. We are an energy business transforming into the UK’s sovereign choice for AI. Vertically integrated from soil to software transforming legacy industrial sites into renewable powered AI data hubs.

Designed, owned, and operated by Carbon3.ai, all infrastructure and data processing are located within the UK and fully subject to UK jurisdiction and regulatory oversight. We generate our own off-grid renewable power, providing low-cost, sustainable energy comparable to Nordic levels, making AI workloads both affordable and sustainable.

We own 50+ sites across the UK and are rapidly scaling them into AI data centres, enabling high-density, low-latency, sovereign AI deployment at national scale. Whether you're training models, deploying intelligent agents, or building industry-specific solutions, Carbon3.ai accelerates your journey from concept to production.

Backed by strategic partnerships with leading brands and robust investment, we’re building the infrastructure to power the UK’s most ambitious AI innovation – ensuring British enterprises can access world-class AI capabilities securely and sustainably.

Similar Jobs

Cencora Logo Cencora

Devops Engineer

Healthtech • Logistics • Pharmaceutical
Remote
United Kingdom
51000 Employees
27K-35K Annually

OAG Logo OAG

Devops Engineer

Aerospace • Travel • Analytics
In-Office or Remote
Luton, Bedfordshire, England, GBR
956 Employees

StarCompliance Logo StarCompliance

Devops Engineer

Fintech • Analytics • Financial Services
Remote
United Kingdom
164 Employees

FetLife Logo FetLife

Senior Devops Engineer

Cloud • Social Media • Software
Remote
24 Locations
41 Employees
115K-180K Annually

Similar Companies Hiring

Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Bellagent Thumbnail
Artificial Intelligence • Machine Learning • Business Intelligence • Generative AI
Chicago, IL
20 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account