Senior Site Reliability Engineer

Posted Yesterday
Be an Early Applicant
Bengaluru, Bengaluru Urban, Karnataka, IND
In-Office
Senior level
Cloud
The Role
Design, build, and operate large-scale cloud services with an automation-first mindset. Lead incident response, define SLIs/SLOs, improve observability, implement IaC and CI/CD, modernize workloads, build self-service platforms, and mentor engineers to improve reliability, scalability, and security of production systems.
Summary Generated by Built In

Secure Every Identity, from AI to Human
Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.
This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

Get to know Okta

Okta is The World’s Identity Company. We free everyone to safely use any technology—anywhere, on any device or app. Our Workforce and Customer Identity Clouds enable secure yet flexible access, authentication, and automation that transforms how people move through the digital world, putting Identity at the heart of business security and growth. 


At Okta, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box, we’re looking for lifelong learners and people who can make us better with their unique experiences. 


Join our team! We’re building a world where Identity belongs to you. 


The Engineering Opportunity

We are looking for an experienced Senior Site Reliability Engineer to join Okta's Emerging Products Group (EPG). Our mission is to build highly reliable, scalable, and secure cloud services that our customers can trust. We embrace an automation-first mindset and continuously invest in platform engineering, observability, and operational excellence to enable our engineering teams to move quickly and safely.

This role is ideal for an experienced Site Reliability Engineer who enjoys solving complex technical challenges at scale, building automation, and improving the reliability of production systems. You will serve as a key contributor within the EPG SRE organization, partnering closely with software engineers, architects, and product teams to design, build, and operate world-class cloud services.

The ideal candidate exemplifies the philosophy of "if you have to do it more than once, automate it" and possesses a strong passion for continuous improvement, operational excellence, and software engineering.

  What You'll Be DoingReliability & Operations
  • Design, build, and operate large-scale cloud infrastructure and production services.
  • Participate in an on-call rotation supporting highly available customer-facing systems.
  • Lead incident response efforts and drive post-incident reviews focused on systemic improvements.
  • Define, measure, and improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
  • Partner with engineering teams to improve service availability, scalability, performance, and resilience.
  • Continuously improve observability through metrics, logging, tracing, dashboards, and alerting.
Engineering & Automation
  • Develop software, automation, and infrastructure using Go, Python, Terraform, and related technologies.
  • Eliminate operational toil through automation, tooling, and platform engineering.
  • Improve deployment safety and operational workflows through CI/CD and GitOps practices.
  • Collaborate on modernizing existing workloads and aligning them with evolving platform capabilities.
  • Build self-service platforms, operational guardrails, and automation that improve developer velocity while maintaining reliability and security.
Technical Leadership
  • Contribute to and drive reliability initiatives within the product group.
  • Guide engineers in adopting operational best practices and reliability engineering principles.
  • Mentor engineers through technical collaboration, design reviews, incident analysis, and knowledge sharing.
  • Support architecture and operational decisions through data-driven recommendations and engineering expertise.
  • Execute projects from conception through production rollout and long-term operational ownership.
Innovation
  • Explore and apply AI-assisted engineering techniques to improve operational efficiency, incident response, troubleshooting, and automation.
  • Identify opportunities to leverage emerging technologies to reduce toil and improve engineering productivity.
Our Tech Stack
  • Infrastructure/Orchestration: Kubernetes (EKS/GKE), Terraform, Helm, Git, ArgoCD, GitOps
  • Programming: Golang, Python
  • Observability: Datadog, Splunk
  • Data Stores: PostgreSQL, Redis, OpenSearch

What We Are Looking For

Technical Excellence
  • Strong experience operating large-scale production services in AWS and/or GCP.
  • Deep expertise with Kubernetes in production environments.
  • Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle issues.
  • Extensive experience with Infrastructure as Code technologies such as Terraform and Helm.
  • Strong software engineering skills in Golang and/or Python.
  • Experience building automation and internal engineering platforms.
  • Experience operating and troubleshooting distributed data platforms such as PostgreSQL, Redis, OpenSearch, MySQL, Cassandra, or similar technologies.
  • Strong understanding of cloud networking fundamentals including DNS, load balancing, ingress, TLS, service networking, and traffic management.
  • Experience with observability platforms, monitoring strategies, and production telemetry.
  • Experience with or strong interest in AI-assisted engineering and operational automation.
Operational Excellence
  • Strong expertise operating customer-facing production systems.
  • Experience leading incident response and driving operational improvements.
  • Deep understanding of reliability engineering concepts including SLIs, SLOs, error budgets, and capacity planning.
  • Strong understanding of CI/CD pipelines, deployment strategies, and automation-first operational practices.
  • Proven ability to balance reliability, scalability, security, and engineering velocity.
 Security & Compliance
  • Understanding of cloud security fundamentals, IAM, secrets management, and secure infrastructure design.
  • Experience implementing operational controls and best practices in regulated or security-sensitive environments is a plus.
Leadership
  • Demonstrated experience contributing to complex engineering initiatives.
  • Strong collaboration and communication skills.
  • Experience working effectively within globally distributed engineering organizations spanning multiple timezones and cultures.
  • Experience mentoring engineers and elevating technical capabilities within an organization.
  • Ability to collaborate on technical direction through expertise, partnership, and execution.
Preferred Qualifications
  • Experience operating SaaS platforms serving large-scale customer workloads.
  • Experience working within Kubernetes-based microservices environments.
  • Experience supporting globally distributed production environments.
  • Experience with GitOps and ArgoCD.
  • Experience implementing AI-assisted operational tooling or automation workflows.

#LI-Hybrid
#P22403

The Okta Experience

  • Supporting Your Well-Being 
  • Driving Social Impact 
  • Developing Talent and Fostering Connection + Community

We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.
Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.
If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.
Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please click here to view our full NYC AEDT Notice.

Skills Required

  • Strong experience operating large-scale production services in AWS and/or GCP
  • Deep expertise with Kubernetes in production (EKS/GKE)
  • Experience troubleshooting Kubernetes networking, storage, scheduling, scaling, and workload lifecycle
  • Extensive experience with Infrastructure as Code such as Terraform and Helm
  • Strong software engineering skills in Go (Golang) and/or Python
  • Experience building automation, internal engineering platforms, and eliminating operational toil
  • Experience operating and troubleshooting distributed data platforms (PostgreSQL, Redis, OpenSearch, MySQL, Cassandra, or similar)
  • Strong understanding of cloud networking fundamentals (DNS, load balancing, ingress, TLS, service networking, traffic management)
  • Experience with observability platforms and production telemetry (Datadog, Splunk, metrics, logging, tracing)
  • Experience leading incident response and driving post-incident improvements
  • Deep understanding of reliability engineering concepts (SLIs, SLOs, error budgets, capacity planning)
  • Strong understanding of CI/CD pipelines, deployment strategies, and GitOps practices
  • Understanding of cloud security fundamentals, IAM, and secrets management
  • Experience implementing operational controls in regulated or security-sensitive environments
  • Experience with GitOps and ArgoCD
  • Experience with AI-assisted engineering or interest in AI-assisted operational tooling

Okta Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Okta and has not been reviewed or approved by Okta.

  • Healthcare Strength Health coverage spans medical, dental, vision, mental-health support, and income protection, complemented by preventive care options and wellness resources. These elements indicate robust coverage for both routine needs and more complex situations.
  • Parental & Family Support Policies include paid parental leave, adoption and surrogacy assistance, and fertility and family‑building benefits. Caregiving resources and flexible arrangements help employees navigate family responsibilities.
  • Leave & Time Off Breadth Flexible or unlimited PTO, separate sick time, paid holidays, and a company Wellbeing Week provide multiple avenues for time away. This breadth supports rest, recovery, and work‑life balance.

Okta Insights

Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
HQ: San Francisco, CA
6,000 Employees
Year Founded: 2009

What We Do

Okta is the leading independent identity provider. The Okta Identity Cloud enables organizations to securely connect the right people to the right technologies at the right time. With more than 7,000 pre-built integrations to applications and infrastructure providers, Okta provides simple and secure access to people and organizations everywhere, giving them the confidence to reach their full potential. More than 10,000 organizations, including JetBlue, Nordstrom, Siemens, Slack, T-Mobile, Takeda, Teach for America, and Twilio, trust Okta to help protect the identities of their workforces and customers.

Similar Jobs

Optum Logo Optum

Senior Site Reliability Engineer

Artificial Intelligence • Big Data • Healthtech • Information Technology • Machine Learning • Software • Analytics
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
160000 Employees

Cisco Logo Cisco

Senior Site Reliability Engineer

Cloud • Information Technology • Internet of Things • Professional Services • Software
In-Office
7 Locations
77500 Employees

Cisco Logo Cisco

Senior Site Reliability Engineer

Cloud • Information Technology • Internet of Things • Professional Services • Software
In-Office
7 Locations
77500 Employees

Cisco Logo Cisco

Senior Site Reliability Engineer

Cloud • Information Technology • Internet of Things • Professional Services • Software
In-Office
Bangalore, Bengaluru Urban, Karnataka, IND
77500 Employees

Similar Companies Hiring

NetBox Labs Thumbnail
Cloud • Software
US
125 Employees
Yooz Thumbnail
Software • Machine Learning • Fintech • Financial Services • Cloud • Automation • Artificial Intelligence
Aimargues, FR
470 Employees
Amplify Platform Thumbnail
Fintech • Financial Services • Consulting • Cloud • Business Intelligence • Big Data Analytics
Scottsdale, AZ
62 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account