Acclaim AI

Senior DevOps Engineer

Posted Yesterday

Be an Early Applicant

27 Locations

Remote

Senior level

Artificial Intelligence • Information Technology • Cybersecurity

The Role

Deploy, operate, and scale Kubernetes-based microservices across cloud and on-prem; run GPU ML inference services; build Docker images and Helm charts; design CI/CD pipelines; implement IaC with Terraform and Ansible; build observability (Grafana/Prometheus/ELK/Loki); ensure SOC 2 compliance and secure cluster access.

Summary Generated by Built In

Description

We are looking to strengthen our team for a DevOps/SRE Engineer!

Requirements

Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering role
Strong hands-on experience with Linux system administration
Extensive experience deploying, operating, and scaling Kubernetes in both cloud and bare-metal environments
Deep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform)
Experience with ML inference on GPU/CPU is a strong plus
Proven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and Loki
Strong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principles
Advanced expertise in Terraform, Ansible, and Python
Comfortable working in high-uncertainty environments: we are building a new product, requirements evolve quickly, and the ability to rapidly learn new technologies and patterns is essential
Proactive mindset: ability to look beyond DevOps tasks and actively debug and understand the product
Strategic thinking: ability to choose technologies and architectural approaches based on long-term goals rather than short-term compromises

Responsibilities

Deploy, operate, and evolve a microservices-based platform running in Kubernetes clusters across AWS, GCP, and on-prem (Rancher)
Operate and support GPU-based ML inference services (Triton Inference Server, vLLM) deployed on RunPod, Scaleway, and Nebius
Build and maintain Docker images for all microservices and ensure a stable service lifecycle
Maintain and scale development and production Kubernetes clusters, actively participate in deployment debugging, incident investigation, and performance troubleshooting
Develop, maintain, and evolve custom Helm charts for each service
Design and operate CI/CD pipelines using GitHub (code and pipelines) and GitLab for on-prem customer deployments
Ensure platform compliance with SOC 2 requirements and actively contribute to improving security and compliance processes
Manage cluster access via NetBird VPN, implementing role-based access control using group policies
Deploy and manage infrastructure using IaC practices with Terraform and Ansible
Develop and continuously improve observability systems:
Grafana & Prometheus for metrics
ELK stack for centralized log storage and analysis
Continuously optimize infrastructure in the areas of IaC, IAM, Observability, and CI/CD
Work with a technology stack, including: Python, Kubernetes, Linux, Docker, GitHub CI/CD, PostgreSQL, ClickHouse, Kafka, Superset, Terraform, Ansible

What we offer

The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world
Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment
High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly
Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else
Startup pace with enterprise stability — real clients, real revenue, no bureaucracy
Fully remote
21 vacation days + public holidays + 5 sick days
Private English lessons via Preply

Skills Required

Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering role
Strong hands-on experience with Linux system administration
Extensive experience deploying, operating, and scaling Kubernetes in cloud and bare-metal environments
Deep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform)
Experience with ML inference on GPU/CPU
Proven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and Loki
Strong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principles
Advanced expertise in Terraform, Ansible, and Python
Experience operating and supporting GPU-based ML inference services (Triton Inference Server, vLLM)
Build and maintain Docker images and custom Helm charts for microservices
Maintain and scale development and production Kubernetes clusters; participate in deployment debugging, incident investigation, and performance troubleshooting
Design and operate CI/CD pipelines using GitHub and GitLab
Ensure platform compliance with SOC 2 and contribute to security and compliance processes
Manage cluster access via NetBird VPN and implement role-based access control using group policies
Experience with ELK stack for centralized log storage and analysis
Experience with PostgreSQL, ClickHouse, Kafka, and Superset
Comfortable working in high-uncertainty environments and rapidly learning new technologies
Proactive mindset and ability to debug and understand the product beyond DevOps tasks
Strategic thinking to choose long-term technologies and architectural approaches

Acclaim AI Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Acclaim AI and has not been reviewed or approved by Acclaim AI.

Parental & Family Support — Parental leave is explicitly listed on the company’s Wellfound profile. Feedback suggests family support is part of the baseline perks communicated publicly.
Leave & Time Off Breadth — Generous vacation is highlighted on Wellfound. This points to broader time-off flexibility typical of startup-style packages.
Wellbeing & Lifestyle Benefits — Professional development and company events are called out on Wellfound. These signals indicate investment in growth and team connection beyond core benefits.

Learn more about Acclaim AI's Compensation & Benefits →

Acclaim AI Insights

What's It Like to Work at Acclaim AI? Acclaim AI Culture & Values Acclaim AI Career Growth & Development What's the Work-Life Balance Like at Acclaim AI? Acclaim AI Leadership & Management Acclaim AI Company Growth, Stability & Outlook

View all jobs at Acclaim AI

View Acclaim AI Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

69 Employees

What We Do

Acclaim is a voice-first AI customer experience (CX) platform purpose-built for regulated industries including banking, fintech, healthcare, and insurance. It provides enterprises with goal-oriented AI agents that go beyond conversation to deliver agentic solutions that solve end-to-end business problems—orchestrating and executing complete customer workflows from outreach through resolution. Acclaim's solutions transform human-driven CX processes into AI-powered ones that are continuously learning and improving. Our platform helps organizations delight with human-quality conversations, accelerate revenue-driving interactions, and safeguard their data by maintaining strict compliance across every customer channel—creating more seamless customer experiences while improving the productivity and satisfaction of human agents. Built on a privacy-first architecture with on-premises or private cloud deployment, Acclaim ensures every interaction is secure, compliant, and delivers results that speak for themselves.