SRE

Posted 5 Days Ago
Be an Early Applicant
Hiring Remotely in Portugal
Remote
Senior level
Artificial Intelligence • Information Technology • Cybersecurity
The Role
Lead reliability and observability for a complex AI platform: define SLIs/SLOs, build monitoring and Grafana dashboards, run load tests and capacity planning, manage incidents and postmortems, operate Kubernetes-based infrastructure across cloud and on-prem, support customer deployments, and mentor the team.
Summary Generated by Built In
Description

We currently have several large-scale projects and are expanding our infrastructure team. Our product is an advanced platform for creating and managing AI agents. It can be deployed directly inside a customer’s infrastructure and delivered as an enterprise solution, while also being available as a SaaS version.

Under the hood, there is real-time voice and telephony, GPU and LLM inference, streaming analytics, and all of this runs both in the cloud and on-prem, including in banking environments. There is a lot of infrastructure; it is complex, interesting, and sometimes at the edge of what is possible. That is why we are looking for a strong SRE who, like us, cares about making systems transparent, reliable, and built the right way.

This is a role for a strong, independent engineer. A Senior SRE with real influence and a voice in how things are built and operated.

You will also handle DevOps tasks for the team, but your main focus and area of expertise should be SRE: reliability, observability, incident management, and performance under load.

Requirements
  • 5+ years in SRE/DevOps. You have not just seen production; you have been responsible for the reliability of high-load production systems.
  • Deep, practical understanding of Docker and Kubernetes. You have operated them in production, not just used them in tutorials.
  • Mature understanding of metrics and alerts, with real hands-on experience writing, tuning, and maintaining them.
  • Practical experience with Prometheus, Alertmanager, and Grafana.
  • Ability and willingness to build dashboards and make them clear, useful, and easy to work with.
  • Experience with SLIs/SLOs, reliability management, incident investigation, and postmortems.
  • Experience with load testing and basic capacity planning.
  • Python: you can write code and confidently read and modify other people’s code for automation, exporters, tooling, and related tasks.
  • Cloud experience with GCP and/or AWS, strong Linux skills, and solid networking knowledge at an operational level.
  • DevOps fundamentals: CI/CD and infrastructure as code, including GitHub Actions, Terraform, Ansible, and similar tools.
  • Willingness to understand and support the product in customer environments, including on-prem deployments.
  • Ownership mindset: you take responsibility for a task, drive it to completion, and think one step ahead.
  • Friendly, non-toxic, and pleasant to work with.
  • Strong communication with developers: you can clearly and constructively explain your position, defend it when needed, and find common ground.
  • Willingness and ability to mentor, teach, and share knowledge with others.
  • Analytical mindset: you dig down to the root cause instead of just treating symptoms.
  • Proactivity: you would rather prevent an outage than heroically fight it later.
  • Strong attention to detail and reliability.

Nice to have

  • Experience using AI agents for routine and recurring tasks.
  • Real-time telephony: SIP, FreeSWITCH, RTP, WebRTC.
  • GPU/ML serving: Triton, vLLM, RunPod, Nebius, Lambda, run:ai, DCGM; understanding of the specifics of deploying LLM/ML models.
  • Streaming data and analytics: Kafka, ClickHouse.
  • Deep experience with IaC and GitOps, such as Terraform, Ansible, ArgoCD; logging with Loki/ELK; gRPC.
  • Experience working in isolated and highly secure environments.
  • Experience preparing systems for significant growth in load.
Responsibilities
  • You will be responsible for the reliability of our services: SLIs/SLOs, availability, and identifying and eliminating bottlenecks across the system.
  • You will set up monitoring for services, metrics, alerts, and dashboards. This will rarely come as a clearly defined task; more often, you will decide what is important to measure and bring it to a clear, usable view.
  • You will build and maintain Grafana dashboards that people actually use, both our team and our customers.
  • You will run load testing, analyze the results, and provide recommendations on resources and scaling.
  • You will investigate incidents, participate in on-call rotations, write and lead postmortems, and ensure the same failure does not happen again.
  • You will work closely with developers: communicate and defend your position, challenge technical decisions, and find win-win solutions.
  • You will develop and support Kubernetes-based infrastructure across our clouds, including GCP and AWS, automate routine work, and help with CI/CD and general team tasks.
  • You will take part in delivering and supporting the platform for customers, including on-prem deployments.
  • You will mentor colleagues and help raise the engineering bar across the team.
What we offer
  • The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world 
  • Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment
  • High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly 
  • Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else 
  • Startup pace with enterprise stability — real clients, real revenue, no bureaucracy 
  • Fully remote across Europe
  • 21 vacation days + public holidays + 5 sick days 
  • Private English lessons via Preply

Skills Required

  • 5+ years in SRE/DevOps, responsible for high-load production system reliability
  • Production experience with Docker and Kubernetes
  • Hands-on experience writing, tuning, and maintaining metrics and alerts
  • Practical experience with Prometheus, Alertmanager, and Grafana
  • Ability and willingness to build clear, useful dashboards
  • Experience with SLIs/SLOs, incident investigation, and postmortems
  • Experience with load testing and basic capacity planning
  • Proficient in Python for automation and tooling
  • Cloud experience with GCP and/or AWS
  • Strong Linux skills and solid operational networking knowledge
  • DevOps fundamentals: CI/CD and Infrastructure as Code (GitHub Actions, Terraform, Ansible)
  • Willingness to understand and support product in customer on-prem environments
  • Ownership mindset, strong communication, mentoring ability, analytical and proactive approach, attention to detail
  • Experience using AI agents for routine tasks
  • Real-time telephony: SIP, FreeSWITCH, RTP, WebRTC
  • GPU/ML serving experience (Triton, vLLM, RunPod, Nebius, run:ai, DCGM)
  • Streaming data and analytics experience (Kafka, ClickHouse)
  • Deep IaC/GitOps experience (ArgoCD), logging (Loki/ELK), gRPC
  • Experience working in isolated/highly secure environments and preparing systems for significant growth

Acclaim AI Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Acclaim AI and has not been reviewed or approved by Acclaim AI.

  • Parental & Family Support Parental leave is explicitly listed on the company’s Wellfound profile. Feedback suggests family support is part of the baseline perks communicated publicly.
  • Leave & Time Off Breadth Generous vacation is highlighted on Wellfound. This points to broader time-off flexibility typical of startup-style packages.
  • Wellbeing & Lifestyle Benefits Professional development and company events are called out on Wellfound. These signals indicate investment in growth and team connection beyond core benefits.

Acclaim AI Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
69 Employees

What We Do

Acclaim is a voice-first AI customer experience (CX) platform purpose-built for regulated industries including banking, fintech, healthcare, and insurance. It provides enterprises with goal-oriented AI agents that go beyond conversation to deliver agentic solutions that solve end-to-end business problems—orchestrating and executing complete customer workflows from outreach through resolution. Acclaim's solutions transform human-driven CX processes into AI-powered ones that are continuously learning and improving. Our platform helps organizations delight with human-quality conversations, accelerate revenue-driving interactions, and safeguard their data by maintaining strict compliance across every customer channel—creating more seamless customer experiences while improving the productivity and satisfaction of human agents. Built on a privacy-first architecture with on-premises or private cloud deployment, Acclaim ensures every interaction is secure, compliant, and delivers results that speak for themselves.

Similar Jobs

Remote
Portugal
600 Employees

Circle (circle.so) Logo Circle (circle.so)

Senior Site Reliability Engineer

Artificial Intelligence • Consumer Web • Digital Media • Information Technology • Social Impact • Software
Easy Apply
Remote
31 Locations
250 Employees
130K-140K Annually

Kraken Digital Asset Exchange Logo Kraken Digital Asset Exchange

Site Reliability Engineer

Blockchain • Financial Services • Cryptocurrency • Web3
Remote
22 Locations
2900 Employees

N-iX Logo N-iX

Site Reliability Engineer

Information Technology • Consulting
Remote
27 Locations
2135 Employees

Similar Companies Hiring

Hanover Park Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
42 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees
Onshore Thumbnail
Artificial Intelligence • Fintech • Software • Financial Services
New York, New York
60 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account