Carbon3.ai

Platform Engineer - Observability (Contract)

Posted 8 Days Ago

Be an Early Applicant

Hiring Remotely in United Kingdom

Remote

Senior level

Artificial Intelligence • Information Technology

The Role

Design, implement, and operate a multi-site, multi-tenant observability platform using the Grafana stack and related tooling. Configure telemetry ingestion, dashboards, alerting, SLOs, and automation; ensure scalability, tenant isolation, long-term storage and DR; integrate telemetry sources and collaborate with application teams on onboarding and tracing.

Summary Generated by Built In

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations

Initial 6 month contract.

June start date.

Competitive day rate.

Key Responsibilities:

Observability Platform Implementation:

Deliver the implementation of Era4's observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
Design and implement highly available observability services across multiple co-location and production sites.
Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
Implement multi-tenant observability controls and tenant isolation strategies.
Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.

Telemetry Collection & Integration:

Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
Develop and maintain observability integrations using OpenTelemetry standards and protocols.
Establish onboarding processes for new platforms, applications, and infrastructure services.
Collaborate with application teams to define observability requirements and future tracing adoption strategies.

Alerting & Operational Insights:

Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
Develop operational dashboards and service health views for infrastructure, platform, and application services.
Support integration of observability events with ITSM and incident-management platforms.
Define SLIs, SLOs, alert thresholds, and operational KPIs.
Continuously improve platform observability, incident detection, and root-cause analysis capabilities.

Reliability & Automation:

Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
Design and validate disaster recovery, resilience, and failover capabilities across observability services.
Contribute to platform security, compliance, and operational governance initiatives.
Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.

Required Experience & Skills:

Significant experience implementing and operating enterprise observability or monitoring platforms.
Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
Knowledge of Linux systems administration and cloud-native infrastructure.
Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
Skilled in developing automation and operational tooling using Python and/or Go.
Previous exposure to creating technical architecture, operational documentation, and deployment designs.
Experience with object storage technologies and distributed data platforms.
Strong understanding of monitoring, alerting, and operational event management.

One or more of the following would be advantageous:

Implemented OpenTelemetry-based observability solutions.
Operated observability platforms in service-provider, cloud, or large-scale enterprise environments.
Supported GPU, AI/ML, or high-performance computing environments.
Integrated observability platforms with ITSM solutions.
Experience with multi-tenant platform architectures.
Knowledge of networking, storage, and data-centre infrastructure monitoring.
Understanding of distributed tracing and application performance monitoring.

Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

Diversity & Inclusion:

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Skills Required

Significant experience implementing and operating enterprise observability or monitoring platforms
Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles
Experience with Grafana ecosystem technologies (Grafana, Prometheus, Grafana Mimir, Loki, Tempo, Grafana Alloy, Grafana Enterprise)
Experience designing Kubernetes-native solutions and operating distributed platforms at scale
Knowledge of Linux systems administration and cloud-native infrastructure
Experience implementing Infrastructure-as-Code and GitOps approaches
Experience with Ansible
Skilled in developing automation and operational tooling using Python and/or Go
Experience with object storage technologies and distributed data platforms
Experience designing alerting frameworks, SLIs/SLOs, and operational dashboards
Ability to produce technical architecture, high/low level designs, runbooks, and operational documentation

View all jobs at Carbon3.ai

View Carbon3.ai Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Rugby

16 Employees

What We Do

Carbon3.ai is building the UK’s sovereign AI platform – secure, sustainable, and designed for real-world impact. AI growth demands are creating new challenges and compute power requirements are outpacing supply. At Carbon3.ai, we’re not just providing infrastructure, we’re building the foundations to overcome these challenges. We are an energy business transforming into the UK’s sovereign choice for AI. Vertically integrated from soil to software transforming legacy industrial sites into renewable powered AI data hubs. Designed, owned, and operated by Carbon3.ai, all infrastructure and data processing are located within the UK and fully subject to UK jurisdiction and regulatory oversight. We generate our own off-grid renewable power, providing low-cost, sustainable energy comparable to Nordic levels, making AI workloads both affordable and sustainable. We own 50+ sites across the UK and are rapidly scaling them into AI data centres, enabling high-density, low-latency, sovereign AI deployment at national scale. Whether you're training models, deploying intelligent agents, or building industry-specific solutions, Carbon3.ai accelerates your journey from concept to production. Backed by strategic partnerships with leading brands and robust investment, we’re building the infrastructure to power the UK’s most ambitious AI innovation – ensuring British enterprises can access world-class AI capabilities securely and sustainably.