Site Reliability Engineer

Posted 8 Days Ago
Be an Early Applicant
2 Locations
In-Office or Remote
Senior level
Artificial Intelligence • Information Technology
The Role
Build AI-driven SRE tooling and agentic automation to triage, diagnose, and remediate infrastructure issues. Integrate LLM-powered agents with observability, ITSM, and infrastructure APIs, develop self-service tooling and ChatOps, tune event/alert intelligence, and convert runbooks into auditable executable automations while contributing to operational standards and incident learning.
Summary Generated by Built In

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations


Role Summary:

We are seeking Automation Engineers who sit at the intersection of Site Reliability Engineering and modern AI-driven operations. Embedded within Era4's engineering-led Operations Centre, this role exists to build a modern AI Platform Operations function from scratch, designing tooling, and agentic workflows.  No legacy to deal with.

 

Key Responsibilities:

 

Runbook Automation & Agent Development:

  • Build agentic, executable workflows capable of triaging, diagnosing, and where appropriate autonomously remediating known failure patterns.
  • Build and maintain LLM-backed agents targeting the observability stack, ITSM platform, and infrastructure APIs (e.g. DCIM, IPAM, hypervisor layers).
  • Develop auditable Client focused automations, for Client interactions and workflows, with appropriate controls
  • Develop safe, auditable automation with appropriate controls for higher-risk platform actions

 

Operational Tooling & Self-Service Enablement:

  • Build internal tooling that empowers engineers and service desk analysts: CLI utilities, ChatOps integrations (Slack/Teams bots), status dashboards, and self-service automation hooks.
  • Reduce dependency on DevSecOps and engineering teams for routine operational tasks through automation.
  • Maintain and contribute a library of automation assets, agent prompts, and runbook-as-code artefacts, version-controlled and peer-reviewed.

 

Event & Alert Intelligence:

  • Develop the automation layer around monitoring and event management: alert suppression logic, enrichment pipelines, correlation rules, and alert-to-ticket integrations.
  • Continuously tune signal-to-noise ratios across monitoring tooling (Prometheus, Mimir, Grafana, or equivalent) to improve situational awareness.
  • Design and implement event correlation and deduplication logic to reduce alert storms and improve incident context.

 

Continuous Improvement & Knowledge Capture:

  • Identify common Operational patterns and tasks as candidates for automation; maintain and prioritise a toil reduction backlog.
  • Participate in post-incident reviews and translate findings into updated automation, runbooks, or agent logic.
  • Contribute to the evolution of Era4's operational standards, tooling architecture, and agent framework.

 

Essential Experience:

 

Operational:

  • Prior experience in an SRE, Senior Operations, or Platform Engineering environment, with exposure to on-call operations and incident management processes.
  • Experience in converting narrative runbooks into executable automation or codified decision trees.
  • Understanding of ITIL-aligned incident and change management principles and ITSM tooling.


Technical – Core Element:

  • Strong Python development skills, including scripting for automation, API integration, and data processing.
  • Hands-on experience with observability and monitoring platforms: Prometheus, Grafana, Mimir, or equivalent.
  • Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management, or similar) via API.
  • Solid understanding of event-driven architectures, message queues, and webhook-based automation patterns.
  • Strong understanding of managing GPU infrastructure in production, key signals and metrics and the automation of workflows 
  • Familiarity with Infrastructure-as-Code principles and cloud-native environments (Kubernetes, Terraform, or similar). 
  • Comfort operating in an API-first environment, integrating agents with infrastructure APIs, DCIM, IPAM, and hypervisor control planes.

 

One or more would be an advantage:

  • Exposure to data centre or colocation operations, particularly high-density compute or GPU infrastructure environments.
  • Experience with ChatOps tooling: building Slack or Microsoft Teams bots for operational workflows.
  • Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network).
  • Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk).
  • Contributions to open-source observability or automation tooling.
  • Experience in a start-up or scale-up environment where tooling is built from scratch.

 

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

 

Diversity & Inclusion

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. 

 

Skills Required

  • Strong Python development skills for automation, API integration, and data processing
  • Hands-on experience with observability and monitoring platforms (Prometheus, Grafana, Mimir or equivalent)
  • Experience integrating with ITSM platforms (ServiceNow, Halo, Jira Service Management or similar) via API
  • Understanding of event-driven architectures, message queues, and webhook-based automation patterns
  • Strong understanding of managing GPU infrastructure in production and automating GPU workflows
  • Familiarity with Infrastructure-as-Code principles and cloud-native environments (Kubernetes, Terraform or similar)
  • Experience building LLM-powered agents using frameworks such as LangChain, LlamaIndex, Anthropic SDK, OpenAI function calling or comparable tooling
  • Understanding of agentic design patterns, human-in-the-loop controls, and structured operational outputs
  • Comfort operating in an API-first environment, integrating agents with DCIM, IPAM, and hypervisor control planes
  • Prior experience in SRE, Senior Operations, or Platform Engineering with on-call and incident management exposure
  • Experience converting narrative runbooks into executable automation or codified decision trees
  • Understanding of ITIL-aligned incident and change management principles and ITSM tooling
  • Exposure to data centre or colocation operations, particularly high-density compute or GPU environments
  • Experience with ChatOps tooling and building Slack or Microsoft Teams bots
  • Familiarity with DCIM platforms and telemetry pipelines (power, thermal, network)
  • Knowledge of OpenTelemetry, distributed tracing, or log aggregation platforms (Loki, ELK, Splunk)
  • Contributions to open-source observability or automation tooling
  • Experience in a start-up or scale-up environment building tooling from scratch
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Rugby
16 Employees

What We Do

Carbon3.ai is building the UK’s sovereign AI platform – secure, sustainable, and designed for real-world impact. AI growth demands are creating new challenges and compute power requirements are outpacing supply. At Carbon3.ai, we’re not just providing infrastructure, we’re building the foundations to overcome these challenges. We are an energy business transforming into the UK’s sovereign choice for AI. Vertically integrated from soil to software transforming legacy industrial sites into renewable powered AI data hubs. Designed, owned, and operated by Carbon3.ai, all infrastructure and data processing are located within the UK and fully subject to UK jurisdiction and regulatory oversight. We generate our own off-grid renewable power, providing low-cost, sustainable energy comparable to Nordic levels, making AI workloads both affordable and sustainable. We own 50+ sites across the UK and are rapidly scaling them into AI data centres, enabling high-density, low-latency, sovereign AI deployment at national scale. Whether you're training models, deploying intelligent agents, or building industry-specific solutions, Carbon3.ai accelerates your journey from concept to production. Backed by strategic partnerships with leading brands and robust investment, we’re building the infrastructure to power the UK’s most ambitious AI innovation – ensuring British enterprises can access world-class AI capabilities securely and sustainably.

Similar Jobs

Unify (unifygtm.com) Logo Unify (unifygtm.com)

Site Reliability Engineer

Artificial Intelligence • Software
In-Office or Remote
2 Locations
64 Employees
250K-295K Annually

Unify (unifygtm.com) Logo Unify (unifygtm.com)

Site Reliability Engineer

Artificial Intelligence • Software
Remote or Hybrid
2 Locations
64 Employees
250K-295K Annually

Unify (unifygtm.com) Logo Unify (unifygtm.com)

Senior Site Reliability Engineer

Artificial Intelligence • Software
Remote or Hybrid
2 Locations
64 Employees
230K-275K Annually

Josys Logo Josys

Senior Site Reliability Engineer

Information Technology • Software
Remote
Office, Machaze, Manica, MOZ
178 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account