Platform Engineer - Observability (Contract)

Posted 8 Days Ago
Be an Early Applicant
Hiring Remotely in United Kingdom
Remote
Senior level
Artificial Intelligence • Information Technology
The Role
Design, implement, and operate a multi-site, multi-tenant observability platform using the Grafana stack and related tooling. Configure telemetry ingestion, dashboards, alerting, SLOs, and automation; ensure scalability, tenant isolation, long-term storage and DR; integrate telemetry sources and collaborate with application teams on onboarding and tracing.
Summary Generated by Built In

Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Era4 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations


Initial 6 month contract.

June start date.

Competitive day rate.


Key Responsibilities:


Observability Platform Implementation:

  • Deliver the implementation of Era4's observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
  • Design and implement highly available observability services across multiple co-location and production sites.
  • Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
  • Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
  • Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
  • Implement multi-tenant observability controls and tenant isolation strategies.
  • Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.

 

Telemetry Collection & Integration:

  • Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
  • Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
  • Develop and maintain observability integrations using OpenTelemetry standards and protocols.
  • Establish onboarding processes for new platforms, applications, and infrastructure services.
  • Collaborate with application teams to define observability requirements and future tracing adoption strategies.

 

Alerting & Operational Insights:

  • Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
  • Develop operational dashboards and service health views for infrastructure, platform, and application services.
  • Support integration of observability events with ITSM and incident-management platforms.
  • Define SLIs, SLOs, alert thresholds, and operational KPIs.
  • Continuously improve platform observability, incident detection, and root-cause analysis capabilities.


Reliability & Automation:

  • Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
  • Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
  • Design and validate disaster recovery, resilience, and failover capabilities across observability services.
  • Contribute to platform security, compliance, and operational governance initiatives.
  • Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.

 

Required Experience & Skills:

  • Significant experience implementing and operating enterprise observability or monitoring platforms.
  • Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
  • Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
  • Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
  • Knowledge of Linux systems administration and cloud-native infrastructure.
  • Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
  • Skilled in developing automation and operational tooling using Python and/or Go.
  • Previous exposure to creating technical architecture, operational documentation, and deployment designs.
  • Experience with object storage technologies and distributed data platforms.
  • Strong understanding of monitoring, alerting, and operational event management.

 

One or more of the following would be advantageous:

  • Implemented OpenTelemetry-based observability solutions.
  • Operated observability platforms in service-provider, cloud, or large-scale enterprise environments.
  • Supported GPU, AI/ML, or high-performance computing environments.
  • Integrated observability platforms with ITSM solutions.
  • Experience with multi-tenant platform architectures.
  • Knowledge of networking, storage, and data-centre infrastructure monitoring.
  • Understanding of distributed tracing and application performance monitoring.


Why Join Era4:

You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.

 

Diversity & Inclusion:

Era4 is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

Skills Required

  • Significant experience implementing and operating enterprise observability or monitoring platforms
  • Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles
  • Experience with Grafana ecosystem technologies (Grafana, Prometheus, Grafana Mimir, Loki, Tempo, Grafana Alloy, Grafana Enterprise)
  • Experience designing Kubernetes-native solutions and operating distributed platforms at scale
  • Knowledge of Linux systems administration and cloud-native infrastructure
  • Experience implementing Infrastructure-as-Code and GitOps approaches
  • Experience with Ansible
  • Skilled in developing automation and operational tooling using Python and/or Go
  • Experience with object storage technologies and distributed data platforms
  • Experience designing alerting frameworks, SLIs/SLOs, and operational dashboards
  • Ability to produce technical architecture, high/low level designs, runbooks, and operational documentation
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: Rugby
16 Employees

What We Do

Carbon3.ai is building the UK’s sovereign AI platform – secure, sustainable, and designed for real-world impact. AI growth demands are creating new challenges and compute power requirements are outpacing supply. At Carbon3.ai, we’re not just providing infrastructure, we’re building the foundations to overcome these challenges. We are an energy business transforming into the UK’s sovereign choice for AI. Vertically integrated from soil to software transforming legacy industrial sites into renewable powered AI data hubs. Designed, owned, and operated by Carbon3.ai, all infrastructure and data processing are located within the UK and fully subject to UK jurisdiction and regulatory oversight. We generate our own off-grid renewable power, providing low-cost, sustainable energy comparable to Nordic levels, making AI workloads both affordable and sustainable. We own 50+ sites across the UK and are rapidly scaling them into AI data centres, enabling high-density, low-latency, sovereign AI deployment at national scale. Whether you're training models, deploying intelligent agents, or building industry-specific solutions, Carbon3.ai accelerates your journey from concept to production. Backed by strategic partnerships with leading brands and robust investment, we’re building the infrastructure to power the UK’s most ambitious AI innovation – ensuring British enterprises can access world-class AI capabilities securely and sustainably.

Similar Jobs

Empathy Logo Empathy

Customer Success Manager

Fintech • Healthtech • HR Tech • Information Technology • Financial Services • Telehealth
In-Office or Remote
London, Greater London, England, GBR
180 Employees
72K-160K Annually

Huntress Logo Huntress

Partner Success Manager

Information Technology • Cybersecurity
Easy Apply
Remote
United Kingdom
630 Employees
48K-60K Annually

CrowdStrike Logo CrowdStrike

Sr. Analyst, Falcon Complete (Remote, GBR)

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
United Kingdom
10000 Employees

CrowdStrike Logo CrowdStrike

Consultant

Cloud • Computer Vision • Information Technology • Sales • Security • Cybersecurity
Remote or Hybrid
United Kingdom
10000 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account