Senior Site Reliability Engineer

Reposted 14 Days Ago
Be an Early Applicant
Bangalore, Bengaluru Urban, Karnataka, IND
In-Office
Senior level
Information Technology
The Role
Design, implement, and maintain reliable cloud infrastructure. Lead incident management, optimize performance, and mentor junior SREs. Collaborate with other teams for system reliability.
Summary Generated by Built In

Why Lytx – Senior Site Reliability Engineer

At Lytx, our engineering culture is built around being hungry, low-ego, and highly capable. We are pragmatic engineers who take ownership, collaborate openly, and focus on delivering measurable operational impact. Our mission is to design, operate, and continuously improve the cloud infrastructure and operational platforms that power mission-critical SaaS and IoT services at scale.

As our systems grow in scale and complexity, we are investing in modern observability, intelligent automation, and data-driven operations to improve reliability, reduce operational noise, and enable faster detection and recovery.

The Site Reliability Engineering (SRE) team is responsible for the availability, reliability, observability, and resilience of our cloud-native environments. This includes building automation, improving operational visibility, and partnering with engineering teams to ensure services are designed and operated for reliability and scale.

As a Senior SRE, you will lead reliability improvements for critical services and platforms, contribute to observability and automation initiatives, and help drive operational excellence through proactive engineering and continuous improvement.

If you enjoy solving complex production challenges, improving system insight, and building automation that makes operations more efficient and reliable, this role is a great fit.

Responsibilities / You’ll get to

Service & Platform Reliability Ownership - Own the reliability, performance, and operational health of critical services and infrastructure components, ensuring systems meet availability and performance expectations.

Observability Implementation - Design and implement monitoring, logging, tracing, and alerting to improve system visibility and ensure high-signal, low-noise operational insights.

Operational Automation - Build automation and tooling to reduce manual operational work, including runbook automation, self-healing workflows, and operational scripting.

Incident Response & Resolution - Lead response for high-severity incidents within your domain, participate in on-call rotations, and drive timely resolution to restore service.

Postmortems & Continuous Improvement - Conduct blameless postmortems, identify root causes, and implement corrective actions that prevent recurrence and improve system resilience.

Capacity & Performance Management - Analyze system performance and usage trends to support capacity planning, scaling strategies, and cost-efficient resource utilization.

Cloud & Infrastructure Engineering - Design, deploy, and operate scalable infrastructure in AWS using Infrastructure-as-Code and cloud-native best practices.

Cross-Functional Collaboration - Partner with product, platform, and development teams to embed reliability, observability, and performance best practices into system design and delivery.

AIOps & Operational Intelligence (Exposure) - Contribute to initiatives that improve operational signal quality, such as alert tuning, event correlation, anomaly detection, or automated remediation.

Team Contribution - Share operational knowledge, and contribute to a culture of ownership, learning, and operational excellence.

Requirements / You’ll Need

Experience

  • 4 - 6 years of experience in SRE, DevOps, platform engineering, or cloud infrastructure roles supporting production environments.
  • Experience operating and supporting 24/7 systems, including participation in on-call rotations and incident response.

Cloud & Infrastructure

  • Hands-on experience designing and operating production workloads in AWS, including services such as EC2, EKS/ECS, RDS/DynamoDB, S3, ALB/NLB, VPC, IAM, and CloudWatch.
  • Experience building infrastructure using Terraform, CloudFormation, or similar Infrastructure-as-Code tools.

Observability

  • Experience implementing monitoring and alerting using tools such as New Relic, Datadog, Prometheus/Grafana, CloudWatch, or similar.
  • Exposure to OpenTelemetry or modern telemetry standards.
  • Ability to improve alert quality, dashboards, and operational visibility.
  • Experience with alert noise reduction, anomaly detection, or other data-driven operational improvements.

Automation & Scripting

  • Strong scripting or programming skills (Python, Go, Bash, or similar) for operational automation and tooling.

Systems Knowledge

  • Solid understanding of Linux systems, networking fundamentals (TCP/IP, DNS, TLS), and distributed system behavior.
  • Experience with Kubernetes and cloud-native architectures preferred.

Operational Excellence

  • Familiarity with AI/ML-assisted operational tooling or AIOps concepts.
  • Experience performing root cause analysis and driving reliability improvements.
  • Ability to troubleshoot complex production issues under pressure.

Collaboration & Communication

  • Strong collaboration skills with the ability to work across engineering teams.
  • Ability to influence reliability improvements within your domain through technical leadership and clear communication.

Innovation Lives Here

You go all in no matter what you do, and so do we. At Lytx, we’re powered by cutting-edge technology and Happy People. You want your work to make a positive impact in the world, and that’s what we do. Join our diverse team of hungry, humble and capable people united to make a difference.

Together, we help save lives on our roadways.

Find out how good it feels to be a part of an inclusive, collaborative team. We’re committed to delivering an environment where everyone feels valued, included and supported to do their best work and share their voices.

Lytx, Inc. is proud to be an equal opportunity/affirmative action employer and maintains a drug-free workplace. We’re committed to attracting, retaining and maximizing the performance of a diverse and inclusive workforce. EOE/M/F/Disabled/Vet.

Top Skills

Argocd
AWS
Bash
Dns
Git
Grafana
Groovy
Helm
HTTP
Jenkins
Kubernetes
Linux
Load Balancing
Message Brokers
New Relic
Nginx
NoSQL
Ntp
Prometheus
Python
Rest
Smtp
SQL
Ssh
Ssl
Tcp/Ip
Terraform
Vault
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Framingham, MA
790 Employees
Year Founded: 1998

What We Do

Learn how Lytx video telematics can help you improve safety, efficiency, and DOT compliance in your fleet. Start improving your fleet operations today.

Similar Jobs

Sabre Corporation Logo Sabre Corporation

Senior Site Reliability Engineer

Information Technology • Software • Travel
In-Office
Bengaluru, Bengaluru Urban, Karnataka, IND
8150 Employees
In-Office
Bangalore, Bengaluru Urban, Karnataka, IND
4090 Employees
6-6 Annually

Notified Logo Notified

Senior Site Reliability Engineer

Cloud • Digital Media • Information Technology • Marketing Tech • Professional Services
Hybrid
Bangalore, Bengaluru Urban, Karnataka, IND
1200 Employees
In-Office
2 Locations
13042 Employees

Similar Companies Hiring

Axle Health Thumbnail
Logistics • Information Technology • Healthtech • Artificial Intelligence
Santa Monica, CA
19 Employees
Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
15 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account