Platform Engineer

Posted 6 Days Ago
Be an Early Applicant
Hiring Remotely in United Kingdom
Remote
Mid level
Artificial Intelligence • Information Technology
The Role
The HPC Platform Engineer will manage AI platform operations, providing L1 and L2 support, coordinating with vendors, monitoring systems, and optimizing resource allocation.
Summary Generated by Built In

We are an emerging AI infrastructure start-up building next-generation data centres and high-performance compute environments to power AI, LLM training, and cloud-scale workloads, powered by renewable energy, rooted in sovereign capability, and designed to give enterprises and innovators the compute they need. Backed by leading investors, we are rapidly expanding our site development pipeline, engineering capabilities, and commercial partnerships.


We are looking for Platform Engineer (HPC & AI) who can assist in shaping our new Platform team, this role will be customer facing, involve technical troubleshooting, and collaboration with vendor engineering teams to ensure seamless AI platform operations.  

 

Key Responsibilities:

  • Coordinate resolution of complex issues (L3) to (vendor) product/engineering teams and manage vendor responses.
  • Monitor system health, alerts, and customer usage patterns.
  • Document solutions/workarounds, create and maintain knowledge, document support procedures. 
  • Automate common tasks and fixes. 
  • Configure and integrate tooling to support optimal operation of the platform, and support tool selection. 
  • Assist customers with platform configuration, onboarding, and usage best practices. 
  • Collaborate with platform and infrastructure support/engineering teams to resolve platform integration issues. 
  • Ensure SLAs and customer satisfaction targets are met.
  • L1 support for customer-reported issues and requests.
  • L2 support by diagnosing, replicating, and troubleshooting issues across platform and infrastructure. 
  • Work with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing. 


Technical responsibilities:

  • Cluster Infrastructure management: Managing the Nvidia GPU cluster .
  • High availability and resilience: Implement failover strategies and manage maintenance events to minimise downtime.
  • Resource allocation and optimisation: Resource partitioning (GPU resources), workload scheduling, capacity planning. 
  • Performance monitoring and troubleshooting: Performance analysis, monitoring (realtime) with available Nvidia and HPE tools.  
  • Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues. 
  • Security and access control: Manage user permissions, RBAC, security hardening, data protection.  

   

Required Skills & Experience: 

  • Extensive experience in technical support, system engineering, or platform operations.
  • Solid understanding of L1 and L2 support processes (ticketing, escalation, troubleshooting).
  • Familiarity with cloud-based platforms, APIs, and distributed systems.
  • Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics).
  • Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk). 
  • Excellent communication skills to interface with both customers and internal / vendor teams. 
  • Good understanding of tools requirements for ML engineers and data scientists, and how to optimize the experience.

  

Core Technical skills: 

  • System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel.
  • Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes and container orchestration.
  • Understanding of automation, monitoring and security with GPU as a service.

   

Preferred experience:

  • Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms. 
  • Experience with GPU resource allocation (across instances, GPUs count and time). 
  • Advanced networking skills with High performance networking, troubleshooting and fine tuning.
  • Background in DevOps or SRE practices. 
  • ITIL familiarity. 

  

Success Metrics: 

  • Customers receive timely, effective support with minimal escalations. 
  • Issues are resolved or routed correctly with high-quality documentation. 
  • The platform maintains strong uptime and customer satisfaction. 

 

Why Join Carbon3.ai:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

Top Skills

Ansible
Cuda
Grafana
Kibana
Kubernetes
Nvidia Gpu
Rhel/Centos
Splunk
Ubuntu
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Rugby
16 Employees

What We Do

Carbon3.ai is building the UK’s sovereign AI platform – secure, sustainable, and designed for real-world impact.

AI growth demands are creating new challenges and compute power requirements are outpacing supply. At Carbon3.ai, we’re not just providing infrastructure, we’re building the foundations to overcome these challenges. We are an energy business transforming into the UK’s sovereign choice for AI. Vertically integrated from soil to software transforming legacy industrial sites into renewable powered AI data hubs.

Designed, owned, and operated by Carbon3.ai, all infrastructure and data processing are located within the UK and fully subject to UK jurisdiction and regulatory oversight. We generate our own off-grid renewable power, providing low-cost, sustainable energy comparable to Nordic levels, making AI workloads both affordable and sustainable.

We own 50+ sites across the UK and are rapidly scaling them into AI data centres, enabling high-density, low-latency, sovereign AI deployment at national scale. Whether you're training models, deploying intelligent agents, or building industry-specific solutions, Carbon3.ai accelerates your journey from concept to production.

Backed by strategic partnerships with leading brands and robust investment, we’re building the infrastructure to power the UK’s most ambitious AI innovation – ensuring British enterprises can access world-class AI capabilities securely and sustainably.

Similar Jobs

Depop Logo Depop

Platform Engineer

eCommerce • Social Media
In-Office or Remote
London, Greater London, England, GBR
2436 Employees

Treatwell Logo Treatwell

Platform Engineer

Healthtech • Software
In-Office or Remote
2 Locations
747 Employees

FRP Advisory Cyprus Logo FRP Advisory Cyprus

Platform Engineer

Fintech • Software • Financial Services
Remote
GB
6 Employees
Remote
4 Locations
977 Employees

Similar Companies Hiring

Scotch Thumbnail
Software • Retail • Payments • Fintech • eCommerce • Artificial Intelligence • Analytics
US
25 Employees
Milestone Systems Thumbnail
Software • Security • Other • Big Data Analytics • Artificial Intelligence • Analytics
Lake Oswego, OR
1500 Employees
Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account