Staff SRE

Reposted 3 Days Ago
Hiring Remotely in Virginia, USA
Remote
184K-233K Annually
Senior level
Information Technology
The Role
Lead technical strategy for observability, operational intelligence, and reliability. Architect telemetry and automation platforms, drive AIOps and large-scale IaC, lead incident response, mentor senior engineers, and standardize SLO/SLI and reliability practices across AWS cloud-native environments.
Summary Generated by Built In

Why Lytx:

Site Reliability Engineering team is responsible for the availability, reliability, observability and resilience of Infrastructure and related automation of the entire fleet of servers on-prem and the expanding cloud posture of the organization. This team’s responsibilities are very critical to the continuity of business of the organization. If you love crafting new solutions and building a scalable cloud and on-prem infrastructure, then this role may be an excellent match for you!

You’ll get to:
  • Build tools and frameworks to monitor systems and ensure highest level of uptime on production environments.

  • Mentor the SRE team on best practices. Develop culture of innovation.

  • Take lead in enhancing our 24/7 on call and incident management process. Build and maintain Run-books. Contribute to design and documentation of the cloud services and SOPs.

  • Influence service design by working closely with Architects, DBAs, Developers, DevOps, Data engineers to bake reliability, scalability and cost optimizations early in the development process.

  • Lead blameless post-mortems. Take ownership of publishing RCA documents for internal and external consumption.

  • Lead initiatives with Service Owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs.

  • Research and evaluate new cloud technologies and vendor offerings to enhance product stability and manageability.

  • Reduce Operational Toil and maintain high degree of automation by adapting IaC first and Gitops principals.

  • Acquire and maintain significant understanding of Lytx production services to ensure timely resolution of production incidents.

You’ll Need:
  • 8+ years of experience as a SRE in an AWS environment at medium to large scale organization.

  • 6+ years of hands-on experience implementing and managing Observability tools (Prometheus, New Relic, Grafana, etc.)

  • High degree of proficiency in programing, preferably using Python, groovy and bash.

  • Hands-on experience managing database technologies (SQL and NoSQL).

  • 5+ years of experience building Infrastructure deployment pipelines using git, Terraform, Helm, Jenkins/JenkinX/ArgoCD etc.

  • Proficient in designing production environments in AWS cloud using various AWS services (VPCs, EKS, IAM, AMI, EC2, CloudWatch, CloudTrail’s, Control Tower, Guard duty, MSK, S3, Glacier, Gateways, Direct Connects, Route53, RDS, ALBs, Autoscaling etc)

  • Extensive with Linux systems and various protocols and technologies (HTTP, REST, TCP/IP, SSL, DNS, SMTP, SSH, NTP, Load Balancing, SQL/NoSQL, Message Brokers, Nginx, Vault , ELK etc)

  • Hands-on experience with Kubernetes and various container and cloud native technologies.

  • Significant experience in participating, implementing, and managing 24-7 on call rotation for SRE team, creating run books, building support procedures and proactively monitor systems across geographical locations

  • Ability to work well under pressure within a technically challenging environment.

Preferred Experience:
  • Hands-on experience managing sophisticated networks in AWS cloud (Direct Connects, Transit gateways, VPNs, BGP, Firewalls, CDNs)

  • Hands-on experience managing Cloud Databases (AWS RDS, Mongo, Elastic Search, Snowflake)

  • Certifications: Multiple AWS Certificates, Kubernetes, Linux, Programming, CI/CD.

Benefits:

  • Medical, dental and vision insurance 
  • Health Savings Account
  • Flexible Spending Accounts
  • Telehealth
  • 401(k) and 401(k) match
  • Life and AD&D insurance
  • Short-Term and Long-Term Disability
  • FTO or PTO
  • Employee Well-Being program
  • 11 paid holidays plus 1 inclusive holiday per year
  • Volunteer Time Off
  • Employee Referral program
  • Education Reimbursement Program
  • Employee Recognition and Appreciation program
  • Additional perk and voluntary benefit programs

Salary is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience.  This position is also eligible for an incentive compensation plan.  The expected hiring salary for this position is:

$183,500.00 - $232,500.00

You’re driven to succeed and so are we. At Lytx, our mission is to protect a world in motion, and we do it by building technology and partnerships that help keep people safe on the road. The way we work is guided by our shared values: Deliver for the customer, Responsibility in every outcome, Innovate with purpose, Velocity with excellence, and Elevate each other.

If you’re looking for meaningful work, a team that challenges and supports you, and the chance to grow your career while making a real impact, we’d love to meet you.

Together, we’re helping make roadways safer and saving lives!

Lytx, Inc. is proud to be an equal opportunity employer. We’re committed to building a diverse and inclusive workforce and do not discriminate based on race, color, religion, sex, sexual orientation, gender identity or expression, gender, genetic information, uniformed service, national origin, age, veteran status, disability, pregnancy, or any other status protected by federal or state law. We are committed to providing reasonable accommodation for candidates with disabilities who need assistance during the hiring process. To request a reasonable accommodation, please email [email protected].  Lytx conducts background checks on applicants who receive a conditional offer of employment in accordance with applicable local, state, federal and regional laws. Qualified applicants with arrest or conviction records will be considered. Background check results may potentially result in the withdrawal of a conditional offer of employment and will be made in accordance with all applicable local, state, federal and regional laws. 

Skills Required

  • 8-10+ years SRE, platform engineering, or cloud infrastructure experience supporting large-scale production environments.
  • Demonstrated experience leading architecture, reliability strategy, or operational platforms across multiple teams.
  • Proven track record operating 24/7 production environments, incident leadership, and postmortem practices.
  • Deep expertise designing and operating large-scale AWS environments (VPC, EC2, EKS/ECS, RDS/DynamoDB, S3, ALB/NLB, IAM, KMS, Route 53, multi-account).
  • Experience designing resilient, fault-tolerant systems using multi-AZ/multi-region patterns, graceful degradation, rate limiting, and capacity management.
  • Senior-level experience with observability platforms and telemetry (New Relic, Datadog, Prometheus, Grafana, OpenTelemetry) and low-noise alerting.
  • Experience defining telemetry standards, instrumentation strategies, centralized dashboards, and improving operational signal quality (correlation, noise reduction).
  • Experience implementing or evaluating AIOps capabilities (anomaly detection, event correlation, predictive alerting, automated remediation).
  • Expert-level Infrastructure-as-Code with Terraform and/or CloudFormation, reusable modules, and GitOps workflows.
  • Strong scripting/programming skills (Python, Go, Bash, or similar) for automation and operational tooling.
  • Expert understanding of Linux systems, networking (TCP/IP, DNS, TLS), and distributed system behavior.
  • Expert with Kubernetes and cloud-native architecture patterns.
  • Demonstrated ability to influence technical direction without direct authority and mentor senior engineers.
Am I A Good Fit?
beta
Get Personalized Job Insights.
Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company
Framingham, MA
790 Employees
Year Founded: 1998

What We Do

Learn how Lytx video telematics can help you improve safety, efficiency, and DOT compliance in your fleet. Start improving your fleet operations today.

Similar Jobs

Zscaler Logo Zscaler

Site Reliability Engineer

Cloud • Information Technology • Security • Software • Cybersecurity
Easy Apply
Remote or Hybrid
Crystal City, VA, USA
8697 Employees
140K-200K Annually

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
10 Locations
5550 Employees
127K-249K Annually

Oscilar Logo Oscilar

Site Reliability Engineer

Artificial Intelligence • Fintech • Software • Financial Services
Remote
2 Locations
104 Employees

MongoDB Logo MongoDB

Site Reliability Engineer

Big Data • Cloud • Software • Database
Easy Apply
Remote or Hybrid
6 Locations
5550 Employees
126K-248K Annually

Similar Companies Hiring

Scrunch  Thumbnail
Artificial Intelligence • Information Technology • Marketing Tech • Software • SEO
Salt Lake City, Utah
Standard Template Labs Thumbnail
Artificial Intelligence • Information Technology • Software
New York, NY
25 Employees
Golden Pet Brands Thumbnail
Digital Media • eCommerce • Information Technology • Marketing Tech • Pet • Retail • Social Media
El Segundo, California
178 Employees

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account