Okta

Senior Site Reliability Engineer- Observability

Posted 2 Days Ago

Be an Early Applicant

Bengaluru, Bengaluru Urban, Karnataka, IND

In-Office

Senior level

Cloud

The Role

Build, run, and scale Okta's multi-cloud observability stack for metrics, logs, traces, and alerts. Implement Monitoring-as-Code with Terraform and CI/CD, optimize Splunk and Grafana pipelines, instrument services with OpenTelemetry and Prometheus, reduce alert noise via programmatic guardrails and auto-remediation, automate operational debt, and participate in on-call rotations and post-incident reviews.

Summary Generated by Built In

Secure Every Identity, from AI to Human
Identity is the key to unlocking the potential of AI. Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence.
This is an opportunity to do career-defining work. We're all in on this mission. If you are too, let's talk.

Position Overview
We are seeking a highly technical Senior Site Reliability Engineer (P3) to help build, run, and scale Okta’s enterprise-grade multi-cloud observability ecosystem. In this role, you will be a core engine driving our full-stack telemetry infrastructure—ensuring that massive streams of Metrics, Logs, Traces, and Alerts are processed efficiently, securely, and cost-effectively.
As a Senior SRE, you will treat Monitoring as Code (MaC). You will utilize automation frameworks like Terraform and write robust code (Go/Python) to build self-healing pipelines, optimize backend telemetry engines (Splunk and Grafana/Mimir/Loki), and eliminate manual operations.
About Team: Workforce Identity Cloud
Okta Workforce Identity Cloud (WIC) provides easy, secure access for your workforce so you can focus on other strategic priorities—like reducing costs, and doing more for your customers.
If you like to be challenged and have a passion for solving large-scale automation, testing, and tuning problems, we would love to hear from you. The ideal candidate is someone who exemplifies the ethics of, “If you have to do something more than once, automate it” and who can rapidly self-educate on new concepts and tools.
Key Responsibilities
- Full-Stack Telemetry Operations: Own and optimize the end-to-end collection, processing, and visualization pipelines for Metrics, Logs, Traces, and Alerts across highly distributed multi-cloud (AWS/GCP) environments.
- Splunk & Grafana Optimization: Act as a hands-on expert in optimizing log pipelines. Drive indexer performance, tune search efficiency (SPL), and clean up heavy dashboard queries to reduce latency and infrastructure footprint (FinOps).
- Monitoring as Code (MaC): Standardize, deploy, and maintain core observability tools, agent relays, and collectors natively using Terraform and automated CI/CD pipelines.
- Distributed Tracing & Metrics: Implement and scale OpenTelemetry (OTel) standards, Prometheus/Mimir, and tracing frameworks to map end-to-end request flows across core microservices (such as our Project Harmony initiative).
- Alert & Dashboard Governance: Implement smart, programmatic alerting guardrails to combat alert fatigue. Deflate noise by building intelligent, actionable alert pathways that route directly to auto-remediation workflows.
- Operational Drive: Help lead the execution against our operational backlog, eliminating systemic technical debt through automation and structural engineering changes.
- On-Call & Incident Co-Pilot: Participate in on-call rotations, providing tier-3 technical escalation support. Run technical post-incident reviews to convert major outages into programmatic observability checks.
Required Skills & Experience (The Essentials)
- Experience: 5+ years of dedicated experience in an SRE, DevOps, or Platform Engineering role managing highly resilient, large-scale distributed systems.
- Log Analytics Mastery (Splunk): Deep, practical experience with Splunk administration, search optimization, cluster maintenance, and writing highly efficient SPL queries at scale.
- Full-Stack Tooling: Hands-on proficiency with major observability tooling suites, specifically Grafana, Prometheus, Loki/Mimir, Cortex, or equivalent cloud-native stacks.
- Telemetry Standards: Practical experience instrumenting applications and infrastructure using OpenTelemetry (OTel), Prometheus metrics, or Jaeger/Tempo distributed tracing.
- Strong Programming Skills: Highly proficient in Go (Golang) or Python for building internal automation, custom exporters, and engineering custom SRE tooling.
- Infrastructure as Code: Solid experience writing, modularizing, and executing production-grade Terraform configurations.
- Cloud & Containers: Deep understanding of Linux internals, core networking protocols (TCP/IP, DNS, TLS), and container orchestration platforms like Amazon EKS / Kubernetes.
Bonus Skills (The "Nice-to-Haves")
- AI & Agentic SRE: Experience or interest in building AI/LLM-driven troubleshooting assistants, automated alert triaging, or smart pattern-matching tools.
- Multi-Cloud Networking: Experience bridging telemetry across hybrid cloud footprints (AWS to GCP).
- Security & Compliance: Familiarity with logging compliance standards, such as Federal STIGs or FIPS-compliant data handling.

#LI-Hybrid

P24819_3370625

The Okta Experience

Supporting Your Well-Being
Driving Social Impact
Developing Talent and Fostering Connection + Community

We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one.
Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran. We also consider for employment qualified applicants with arrest and convictions records, consistent with applicable laws.
If reasonable accommodation is needed to complete any part of the job application, interview process, or onboarding please use this Form to request an accommodation.
Notice for New York City Applicants & Employees: Okta may use Automated Employment Decision Tools (AEDT), as defined by New York City Local Law 144, that use artificial intelligence, machine learning, or other automated processes to assist in our recruitment and hiring process. In accordance with NYC Local Law 144, if you are an applicant or employee residing in New York City, please click here to view our full NYC AEDT Notice.

Skills Required

5+ years in SRE, DevOps, or Platform Engineering managing large-scale distributed systems
Deep Splunk administration, search optimization, cluster maintenance, and efficient SPL query writing
Hands-on experience with Grafana, Prometheus, Loki/Mimir, Cortex or equivalent observability stacks
Practical experience instrumenting with OpenTelemetry, Prometheus metrics, or Jaeger/Tempo tracing
Proficiency in Go (Golang) or Python for automation and custom SRE tooling
Production-grade Terraform authoring, modularization, and execution
Deep understanding of Linux internals and core networking protocols (TCP/IP, DNS, TLS)
Experience with cloud platforms (AWS and/or GCP) and container orchestration (Amazon EKS / Kubernetes)
Experience participating in on-call rotations and handling tier-3 incident escalation
Experience or interest building AI/LLM-driven troubleshooting assistants or automated triage tools
Experience bridging telemetry across hybrid multi-cloud environments (AWS to GCP)
Familiarity with logging compliance standards (e.g., STIGs, FIPS)

Okta Compensation & Benefits Highlights

The following summarizes recurring compensation and benefits themes identified from responses generated by popular LLMs to common candidate questions about Okta and has not been reviewed or approved by Okta.

Healthcare Strength — Health coverage spans medical, dental, vision, mental-health support, and income protection, complemented by preventive care options and wellness resources. These elements indicate robust coverage for both routine needs and more complex situations.
Parental & Family Support — Policies include paid parental leave, adoption and surrogacy assistance, and fertility and family‑building benefits. Caregiving resources and flexible arrangements help employees navigate family responsibilities.
Leave & Time Off Breadth — Flexible or unlimited PTO, separate sick time, paid holidays, and a company Wellbeing Week provide multiple avenues for time away. This breadth supports rest, recovery, and work‑life balance.

Learn more about Okta's Compensation & Benefits →

Okta Insights

What's It Like to Work at Okta? Okta Culture & Values Okta Career Growth & Development What's the Work-Life Balance Like at Okta? Okta Leadership & Management Okta Company Growth, Stability & Outlook

View all jobs at Okta

View Okta Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: San Francisco, CA

6,000 Employees

Year Founded: 2009

What We Do

Okta is the leading independent identity provider. The Okta Identity Cloud enables organizations to securely connect the right people to the right technologies at the right time. With more than 7,000 pre-built integrations to applications and infrastructure providers, Okta provides simple and secure access to people and organizations everywhere, giving them the confidence to reach their full potential. More than 10,000 organizations, including JetBlue, Nordstrom, Siemens, Slack, T-Mobile, Takeda, Teach for America, and Twilio, trust Okta to help protect the identities of their workforces and customers.