Site Reliability Engineering Technical Lead

Posted 9 Days Ago
Be an Early Applicant
Dublin, IRL
In-Office
Senior level
Artificial Intelligence • Cloud • Mobile
The Role
Lead SRE/DevOps efforts to ensure reliability, scalability, and security across multi-cloud environments. Define SLIs/SLOs, lead incident response and postmortems, evolve observability (Prometheus/Grafana/OpenTelemetry), drive automation to reduce toil, optimize cost and performance, apply AI/LLM for ops, and provide architectural oversight and mentoring.
Summary Generated by Built In

Sustainability that means business

Who we are:

Sustainability software specialist, AMCS, is headquartered in Ireland, with offices in Europe, the USA, and Australasia. With over 1,300 highly-skilled employees across 22 countries, we specialize in delivering technology solutions to facilitate a carbon neutral future.

What we do:

Our innovative SaaS solutions increase efficiency and boost sustainability in resource-intensive industries. Over 5,000 customers across 23 countries already benefit from our Performance Sustainability software, ensuring we deliver practical solutions for improved profitability and environmental resilience across the globe.

Our people

AMCS offers team members more than just a job, but an opportunity to map out a career with a company that is growing, evolving and setting out new ways of working that are having a positive impact on the world around us. AMCS was established in Ireland and holds onto those local roots and ‘start-up’ mentality with a culture of connection. Connection to our work, our customers, our colleagues and our community that creates a working environment that fosters openness, collaboration and creativity.

Job Description:

We are seeking a highly skilled and motivated DevOps/SRE Tech Lead to join our dynamic engineering team. The ideal candidate will have a deep understanding of cloud technologies, a strong technical background and a passion for driving operational excellence. As a Tech Lead, you will not only mentor and guide our DevOps engineers but also participate in architectural and key decision-making forums regarding our infrastructure and application development processes ensuring a focus is always on the reliability of our systems and centered on positive customer experience. You will collaborate with cross-functional teams to ensure the reliability, scalability, and security of our systems and infrastructure.

Key Responsibilities:

  • Build SLIs, SLOs, and SLAs: Partner with development and business teams to define indicators and objectives that reflect real customer experience

  • Incident Response: Lead through complex incidents and continuously improve how quickly we detect, diagnose, and resolve issues — sharpening alerting, tooling, and on-call practices to shorten MTTD and MTTR over time.

  • Evolve Monitoring and Observability Stack: Consistently improve the observability stack (Prometheus, Grafana, Mimir, Loki, Tempo, OpenTelemetry) with a customer-centric lens leading our operations to be more effective

  • Drive RCAs and Postmortems: Run blameless root cause analyses and postmortems that turn incidents into durable improvements, closing the developer and operations loop

  • High Availability & Performance: Ensure platform availability and responsiveness meet customer expectations. Identify and remove performance bottlenecks before they impact customer

  • AI for Operations: Apply AI/LLM capabilities to incident triage, log/trace analysis, runbook execution, and anomaly detection to shorten MTTR and reduce on-call load.

  • Optimization for Cost: Right-size workloads, eliminate waste, and design for cost-efficient scaling across our cloud platforms (Azure, AWS, GCP) and container infrastructure (Docker, Kubernetes).

  • Toil Reduction: Build automated processes to reduce toil within SRE, such as remediation for known failure modes so the platform heals itself where possible, escalating to humans only when judgement is genuinely required.

  • Architectural Oversight: Participate in architectural design and decision-making processes, ensuring that design choices align with organizational goals and best practices.

What Success Looks Like:

  • High-Signal Alerting: Alerts are accurate and actionable — when something fires, it matters, and the team trusts it. Noise is actively driven down rather than tolerated.

  • Fewer Production Incidents: The number and severity of customer-impacting incidents trend down over time, as recurring failure modes are addressed at the root rather than worked around.

  • Tight Product–SRE Feedback Loop: Continuous, two-way feedback between product engineering and SRE — reliability concerns shape what gets built, and operational learnings flow back into product decisions.

  • Reduced Toil: Engineers spend less time on repetitive operational work and more time on improvements that compound — measured by what gets automated, eliminated, or self-healed away.

Qualifications:

  • Education: Bachelor’s degree in Computer Science, Engineering, or a related field (or equivalent experience).

  • Experience:5+ years of experience in DevOps, Site Reliability Engineering (SRE), or related fields, with at least 2 years in a leadership or mentoring role.

  • Cloud Technologies: Deep understanding of cloud providers (Azure, AWS, GCP) and hands-on experience with cloud architecture.

  • Architectural Design: Proven experience in providing architectural oversight, with a strong ability to make informed decisions that drive system performance and scalability.

  • Containerization: Proven experience with container orchestration platforms, particularly Kubernetes.

  • Scripting: Proficiency in scripting languages such as PowerShell, Python or Bash.

  • Monitoring and Logging: Familiarity with monitoring and logging tools like Prometheus, Grafana, and the Grafana stack.

  • Automation Tools: Experience with automation tools such as Ansible, Terraform, or Chef.

  • Soft Skills: Strong leadership qualities, excellent communication skills, and a collaborative mindset.

Preferred Qualifications:

  • Experience with CI/CD pipelines and relevant tools (Azure DevOps, Jenkins, GitLab CI, CircleCI, etc.).

  • Kubernetes certification (CKA, CKAD) and/or cloud certifications (Azure, AWS, GCP) are highly desirable.

  • Knowledge of security best practices and compliance standards in cloud environments.

  • Familiarity with Agile methodologies and project management tools.

#LI-JA1

Skills Required

  • Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 5+ years experience in DevOps, Site Reliability Engineering, or related fields.
  • At least 2 years in a leadership or mentoring role.
  • Hands-on experience with cloud providers: Azure, AWS, GCP.
  • Proven experience with containerization and orchestration (Docker, Kubernetes).
  • Proficiency in scripting languages such as PowerShell, Python, or Bash.
  • Familiarity with monitoring and observability stack: Prometheus, Grafana, Mimir, Loki, Tempo, OpenTelemetry.
  • Experience with automation and IaC tools such as Ansible, Terraform, or Chef.
  • Proven experience providing architectural oversight and cloud architecture design.
  • Experience defining/implementing SLIs, SLOs, SLAs, incident response, RCA/postmortems, and reducing operational toil.
  • Experience applying AI/LLM capabilities to operations (incident triage, log/trace analysis, runbook automation).
  • Experience with CI/CD tools (Azure DevOps, Jenkins, GitLab CI, CircleCI).
  • Kubernetes certification (CKA, CKAD) and/or cloud certifications (Azure, AWS, GCP).
  • Knowledge of cloud security best practices and compliance standards.
  • Familiarity with Agile methodologies and project management tools.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Limerick, County Limerick
828 Employees
Year Founded: 2004

What We Do

AMCS is a global leader of integrated software and vehicle technology for the environmental, waste, recycling and resource industries. We help our customers reduce their operating costs, increase asset utilization, optimize margins and improve customer service. Our enterprise software and SaaS solutions deliver digital innovation to the emerging circular economy around the world. We are AMCS, Digital ways to a cleaner world

Similar Jobs

Optum Logo Optum

Associate Director, Industry Relations. Rebate Modeling

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Hybrid
Dublin, IRL
160000 Employees

Optum Logo Optum

Consultant

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Hybrid
Dublin, IRL
160000 Employees

Optum Logo Optum

Senior Product Manager

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
Hybrid
Dublin, IRL
160000 Employees

SEON Logo SEON

Senior Site Reliability Engineer

Artificial Intelligence • Cybersecurity
In-Office or Remote
28 Locations
415 Employees

Similar Companies Hiring

Idler Thumbnail
Artificial Intelligence
San Francisco, California
6 Employees
Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account