Carbon3.ai

Site Reliability Engineer

Reposted 8 Days Ago

Be an Early Applicant

Hiring Remotely in Office, Machaze, Manica, MOZ

Remote

Senior level

Artificial Intelligence • Information Technology

The Role

Build AI-driven SRE tooling and agentic automation to triage, diagnose, and remediate infrastructure issues. Integrate LLM-powered agents with observability, ITSM, and infrastructure APIs, develop self-service tooling and ChatOps, tune event/alert intelligence, and convert runbooks into auditable executable automations while contributing to operational standards and incident learning.

Summary Generated by Built In

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations

Role Summary:

We’re hiring SRE/Platform engineers with an automation bias to help build Era4’s operations capability from the ground up. You’ll turn runbooks, alerts and operational workflows into safe, auditable automation and internal tooling that improves reliability across our AI infrastructure and datacentre platform.

This is a Platform / SRE role with software engineering, not an AI model-building role. You’ll work closely with operations, platform and engineering teams to reduce manual toil, improve alert quality, and speed up incident response.

Key Responsibilities:

Build Python-based automation for incident triage, runbook execution, and routine operational tasks.
Integrate observability, ITSM and infrastructure APIs to enrich alerts and automate workflows.
Improve monitoring signal quality through correlation, enrichment, suppression and deduplication.
Build internal tools and self-service capabilities such as CLI utilities, ChatOps integrations and dashboards.
Maintain version-controlled runbook-as-code and automation libraries.
Translate post-incident learnings into better tooling, automation and operational standards.
Support safe, auditable automation for higher-risk actions with appropriate approval controls.

Essential Experience:

Experience in SRE, Platform Engineering, or production infrastructure operations.
Hands-on experience with observability/monitoring tooling (for example Prometheus, Grafana or similar).
Experience with Kubernetes, Terraform or comparable cloud-native / IaC tooling.
Exposure to incident management / on-call and converting manual runbooks into automation.
Experience with APIs, webhooks, event-driven systems and operational integrations.
Experience with Python for automation, APIs and integrations.

Nice To Have:

GPU, datacentre or colocation infrastructure experience.
ITSM integrations (ServiceNow, Halo, Jira Service Management or similar).
ChatOps tooling (Slack or Microsoft Teams bots).
OpenTelemetry, logging or distributed tracing experience.
DCIM, IPAM or hypervisor-control-plane integrations.
Experience with LLM-assisted or agent-based operational automation.

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

Diversity & Inclusion:

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Skills Required

Strong Python development skills for automation, API integration, and data processing
Hands-on experience with observability and monitoring platforms (Prometheus, Grafana, Mimir or equivalent)
Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management or similar) via API
Understanding of event-driven architectures, message queues, and webhook-based automation patterns
Strong understanding of managing GPU infrastructure in production and automating GPU workflows
Familiarity with Infrastructure-as-Code principles and cloud-native environments (Kubernetes, Terraform or similar)
Experience building LLM-powered agents using frameworks such as LangChain, LlamaIndex, Anthropic SDK, OpenAI function calling or comparable tooling
Understanding of agentic design patterns, human-in-the-loop controls, and structured operational outputs
Comfort operating in an API-first environment, integrating agents with DCIM, IPAM, and hypervisor control planes
Prior experience in SRE, Senior Operations, or Platform Engineering with on-call and incident management exposure
Experience converting narrative runbooks into executable automation or codified decision trees
Understanding of ITIL-aligned incident and change management principles and ITSM tooling
Exposure to data centre or colocation operations, particularly high-density compute or GPU environments
Experience with ChatOps tooling and building Slack or Microsoft Teams bots
Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network)
Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk)
Contributions to open-source observability or automation tooling
Experience in a start-up or scale-up environment building tooling from scratch

View all jobs at Carbon3.ai

View Carbon3.ai Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Mereworth

16 Employees

What We Do

Carbon3.ai is building the UK’s sovereign AI platform – secure, sustainable, and designed for real-world impact. AI growth demands are creating new challenges and compute power requirements are outpacing supply. At Carbon3.ai, we’re not just providing infrastructure, we’re building the foundations to overcome these challenges. We are an energy business transforming into the UK’s sovereign choice for AI. Vertically integrated from soil to software transforming legacy industrial sites into renewable powered AI data hubs. Designed, owned, and operated by Carbon3.ai, all infrastructure and data processing are located within the UK and fully subject to UK jurisdiction and regulatory oversight. We generate our own off-grid renewable power, providing low-cost, sustainable energy comparable to Nordic levels, making AI workloads both affordable and sustainable. We own 50+ sites across the UK and are rapidly scaling them into AI data centres, enabling high-density, low-latency, sovereign AI deployment at national scale. Whether you're training models, deploying intelligent agents, or building industry-specific solutions, Carbon3.ai accelerates your journey from concept to production. Backed by strategic partnerships with leading brands and robust investment, we’re building the infrastructure to power the UK’s most ambitious AI innovation – ensuring British enterprises can access world-class AI capabilities securely and sustainably.