We’re a fast-moving AI Security Company building AI-native infrastructure and applications powered by LLMs and autonomous agents. Our stack is deeply integrated with AWS, Kubernetes, and OpenAI-based systems, and we’re rethinking reliability in a world where software can reason, adapt, and self-heal.
We’re hiring a Senior SRE Engineer to own reliability across our cloud-native and AI-driven platform. You’ll work at the intersection of distributed systems, Kubernetes operations, and LLM-powered automation, building systems that don’t just scale—but think and fix themselves.
WHAT YOU BRING
- 5+ years in SRE / DevOps / Platform Engineering.
- Strong hands-on experience with:
- AWS infrastructure at scale
- Kubernetes (production-grade clusters)
- Proven ability to debug complex distributed systems under pressure.
- Strong coding skills (Python or Go)—you build internal platforms and tools.
- Experience implementing monitoring, alerting, and incident management systems.
- Experience working with LLM APIs such as the OpenAI API.
- Familiarity with agent frameworks like:
- LangChain
- AutoGen
- Built or experimented with:
- AI agents for DevOps / SRE workflows
- Retrieval-Augmented Generation (RAG) systems
- Vector databases (Pinecone, Weaviate, etc.)
- Exposure to AIOps or intelligent automation systems.
Bonus (AI / LLM Focus)
WHAT YOU WILL BE DOING
- Own uptime, reliability, and performance of services running on AWS + Kubernetes (EKS).
- Design and implement self-healing infrastructure using automation and AI agents.
- Build LLM-powered operational tooling using APIs such as the OpenAI API for:
- Intelligent alert triage
- Incident summarization
- Root cause analysis
- Runbook automation
- Manage and scale Kubernetes workloads:
- Deployments, autoscaling, resource optimization
- Cluster reliability and cost efficiency
- Build and evolve observability systems:
- Metrics (Prometheus), dashboards (Grafana)
- Logs (ELK / OpenSearch)
- Tracing (OpenTelemetry)
- Define and enforce SLOs, SLAs, and error budgets tied to business metrics.
- Automate infrastructure using Terraform and CI/CD pipelines.
- Lead incident response, postmortems, and continuous reliability improvements.
- Introduce chaos engineering practices to proactively test system resilience.
Skills Required
- 5+ years in SRE / DevOps / Platform Engineering
- Strong hands-on experience with AWS infrastructure at scale
- Strong hands-on experience with Kubernetes (production-grade clusters)
- Proven ability to debug complex distributed systems under pressure
- Strong coding skills in Python or Go
- Experience implementing monitoring, alerting, and incident management systems
Saviynt Compensation & Benefits Highlights
The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Saviynt and has not been reviewed or approved by Saviynt.
-
Leave & Time Off Breadth — Time off is described as flexible, with policies including flexible time off and mentions of unlimited PTO. This breadth can make time away easier to take alongside company holidays.
-
Wellbeing & Lifestyle Benefits — In‑office amenities such as catered food, drinks, and snacks, plus social events like birthday celebrations and team outings, are highlighted. These lifestyle perks add day‑to‑day convenience and connection.
-
Career-Linked Recognition & Rewards — Employee recognition is emphasized, with programs to celebrate those who go above and beyond. Regular recognition activities are cited alongside team bonding initiatives.
Saviynt Insights
What We Do
Saviynt’s Enterprise Identity Cloud helps modern enterprises scale cloud initiatives and solve the toughest security and compliance challenges in record time. The company brings together identity governance (IGA), granular application access, cloud security, and privileged access to secure the entire business ecosystem and provide a frictionless user experience.








