Level AI

Senior Site Reliability Engineer (Noida, BLR, India)

Reposted 3 Days Ago

Be an Early Applicant

2 Locations

Hybrid

Senior level

Artificial Intelligence • Natural Language Processing • Software • Conversational AI

The Role

Lead Kubernetes and GPU-focused infrastructure: drive FinOps and right‑sizing, optimize on‑prem GPU throughput, build tooling and dashboards for backend teams, ensure reliability instrumentation, and own selected platform security workstreams.

Summary Generated by Built In

About Level AI

Level AI is on a mission to turn every customer interaction into a strategic advantage. Our AI-native platform helps enterprises transform contact centers from cost centers into engines of customer intelligence, operational efficiency, and business growth. By combining advanced AI with deep domain understanding of customer experience, Level AI empowers teams to unlock actionable insights, automate workflows, and deliver more consistent, higher-quality support across the customer journey.

Headquartered in Mountain View, California, Level AI is a Series C company backed by leading investors including Battery Ventures and ENIAC. Our platform leverages Large Language Models and Custom Small Language Models (SLMs) to power AI Agents across the entire CX journey—customer-facing agents, agent-assist, and backend automation—along with deep conversation analytics for QA, coaching, and insights.

About the role
The Senior SRE will be positioned at the intersection of backend engineering, infrastructure operations, and FinOps. The role is explicitly broader than a traditional DevOps engineer and explicitly more hands-on than a pure architect.

What you'll be liable for:

Infrastructure cost efficiency and FinOps. Own the continued reduction of Kubernetes overprovisioning, drive right-sizing programs, and maintain the cost telemetry that backend teams use to make decisions.

GPU throughput optimization. Run a structured experimentation program on on-premise GPU clusters, partnering with AI service owners. Lead by the Engineering leadership, with this role providing the experimental bandwidth.

Backend enablement, not ownership absorption. Build the tooling, dashboards, and processes that let backend teams from other groups own their own cost and reliability budgets. The deliverable is leverage, not headcount-shaped work.

Reliability instrumentation. As the infra team owns most of the instrumentation across new and offline flows, this role takes a central seat in making sure that surface area is captured properly for both cost-at-scale and reliability.

Selective security workstreams. Take on a defined slice of the active security work so that senior DevOps engineers are not the single point of execution for security-adjacent platform changes.

We'll love to explore more about you if you have:

This role explicitly requires 4-5 years of hands-on systems experience. We are not looking for someone who will lean entirely on AI tooling to discover what to do; we are looking for someone who already knows what to ask, and can use AI tooling as a force multiplier on top of that judgement.

Backend engineering depth: production experience in Python, Go/Rust, comfortable owning services end to end, able to read and reason about backend code across teams.

Kubernetes at scale: scheduler behavior, resource requests/limits, HPA/VPA, node pool design, cost-aware autoscaling (Cast AI, Karpenter, or equivalent).

Cloud and on-premise infrastructure: GCP fluency, IaC (Terraform), CI/CD, and comfort operating in hy brid setups including on-prem GPU clusters.

GPU workload understanding: familiarity with throughput profiling, batching, KV-cache behavior, inference server tuning, and GPU utilization metrics.

Observability and reliability: metrics, traces, logs, SLOs, and the discipline to instrument systems properly rather than reactively.

FinOps mindset: demonstrated history of converting infrastructure choices into measurable cost outcomes.

Security baseline: able to take on platform-security workstreams without requiring constant handoff to the DevOps team.

Skills Required

4-5 years hands-on systems experience
Production experience in Python, Go, or Rust
Kubernetes at scale (scheduler behavior, resource requests/limits, HPA/VPA, node pool design, cost-aware autoscaling)
GCP fluency, Infrastructure-as-Code (Terraform), and CI/CD experience; comfort operating hybrid on-prem GPU clusters
GPU workload understanding (throughput profiling, batching, KV-cache behavior, inference server tuning, GPU utilization metrics)
Observability and reliability (metrics, traces, logs, SLOs, disciplined instrumentation)
FinOps mindset with history of converting infrastructure choices into measurable cost outcomes
Platform security baseline able to take on defined security workstreams

View all jobs at Level AI

View Level AI Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Mountain View, CA

122 Employees

Year Founded: 2018

What We Do

Level AI (https://thelevel.ai) is a Mountain View, CA and Delhi, India based startup innovating in the Voice AI space. We are backed by top VCs, technologists from Silicon Valley and industry experts. We are on a mission for AI to augment the worker and not replace them. We are innovating in speech AI, NLP and information retrieval systems to bring customers and businesses closer to one another. The team has experience from Amazon Alexa, Google, and other leading AI organizations.