DISQO

Lead Site Reliability Engineer

Reposted 6 Days Ago

Easy Apply

Los Angeles, CA

Hybrid

185K-195K Annually

Senior level

AdTech • Big Data • Cloud • Marketing Tech • Software • Analytics

DISQO is an audience insights platform where members, real people, share information that improves human experience.

The Role

Lead the Site Reliability Engineering initiatives at DISQO, focusing on reliability, scalability, and performance through AI-driven solutions and automation, while mentoring team members and collaborating with various teams.

Summary Generated by Built In

When you join DISQO Nation, you join a community that values trust, transparency and innovation. We invest in our employees and apply a bottom-up management approach, rooted in the concept of servant leadership. We approach each day eager to learn, grow, and make a lasting impact. Best of all, we have fun while doing it!

About the Role:

We are seeking an experienced Lead Site Reliability Engineer to join our engineering team and drive the reliability, scalability, and performance of our production systems through innovative use of AI and automation. In this role, you will lead SRE initiatives, mentor team members, and leverage AI technologies to enhance operational excellence, predictive maintenance, and intelligent automation across our infrastructure.

Key Responsibilities:

Technical Leadership:

Design and implement comprehensive monitoring, alerting, and observability solutions, leveraging AI for intelligent anomaly detection and root cause analysis
Lead incident response efforts using AI-assisted diagnostics and automated remediation, conduct post-mortems, and drive systemic improvements
Develop and maintain service level objectives (SLOs) and error budgets with AI-powered predictive analytics to forecast reliability risks
Architect and implement intelligent automation solutions for deployment, scaling, and infrastructure management using machine learning models
Drive capacity planning and performance optimization using AI forecasting models and predictive analytics

AI-Enhanced SRE Leadership:

Implement and maintain AI-powered incident prediction and prevention systems
Design intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learning
Develop AI-driven capacity planning models that predict resource needs and optimize cost efficiency
Build and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triage
Implement automated root cause analysis using AI correlation engines and log analysis

Team Leadership & Collaboration:

Mentor junior SREs on integrating AI tools and practices into traditional SRE workflows
Partner with engineering teams to embed AI-enhanced reliability principles into the software development lifecycle
Lead cross-functional initiatives to implement AI-driven operational improvements
Collaborate with data science teams to develop custom AI models for operational use cases
Participate in on-call rotations while developing AI systems to minimize toil and improve response efficiency

Strategic Initiatives:

Develop and execute an SRE roadmap aligned with business objectives and technological advancement
Evaluate and implement new AI tools and technologies to improve system reliability, security and operational efficiency
Drive adoption of AI-powered engineering and predictive failure testing
Establish metrics and reporting using AI analytics to demonstrate the business value of intelligent reliability investments

Required Qualifications:

6+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure-focused roles
2+ years of experience leading technical teams or initiatives
Strong experience with AI/ML tools and frameworks applied to operational use cases (anomaly detection, predictive analytics, NLP)
Hands-on experience implementing AI-powered monitoring, alerting, and automation solutions
Strong programming skills in Python with experience in AI/ML libraries
Extensive experience with cloud platforms (AWS, GCP,) and their AI/ML services
Knowledge of prompt engineering, LLM integration, and building AI-powered operational tools
Proficiency with infrastructure as code and configuration management with AI-enhanced workflows
Experience with time series analysis, statistical modeling, and predictive analytics for infrastructure metrics
Deep understanding of monitoring and observability tools enhanced with AI capabilities
Experience with CI/CD pipelines incorporating AI-driven quality gates and automated decision making
Strong knowledge of networking, distributed systems, and database technologies
Expert level knowledge in following domains: AWS ( core services, networking, compute, databases, storage, etc.. ) TerraformKubernetes / Karpetner / Helm
Strong experience building in-house observability platforms, including: OpenTelemetryLokiGrafanaPrometheusAWS CloudwatchAWS X-Ray or Jaeger
Experience in ArgoCD / ArgoWorkflow will be big plus
Bachelor’s degree in Computer Science, Engineering, Data Science, or equivalent practical experience

Preferred Qualifications:

Advanced experience with large language models (LLMs) for operational documentation, code generation, and incident response
Experience with automated incident response systems using AI decision engines
Experience with microservices architecture and intelligent service mesh management
Familiarity with AI-powered security tools and anomaly detection for infrastructure protection
Experience building and maintaining AI-driven dashboards and reporting systems
Experience with AI-powered cost optimization and resource right-sizing tools
Certification in relevant cloud platforms

#LI-MV1

At DISQO, we pride ourselves on having a positive, performance-oriented workplace that includes a flexible hybrid approach, competitive medical benefits, and an amazing vacation policy. Read more about our culture on Glassdoor.

You can learn more about what’s happening at DISQO by visiting the DISQO Developer Blog or the DISQO Company Blog.

Perks & Benefits:

·100% covered Medical/Dental/Vision for employee, competitive dependent coverage

·Equity

·401K

·Generous PTO policy

·Flexible workplace policy

·Team offsites, social events & happy hours

·Life Insurance

·Health FSA

·Commuter FSA (for hybrid employees)

·Catered lunch and fully stocked kitchen

·Paid Maternity/Paternity leave

·Disability Insurance

·Travel Assistance Program

·24/7 Counseling Services offered to Employees

Note: The benefits noted above are for full time US based employees only.

DISQO is an equal opportunity employer. Discovery, innovation, and growth are possible when we open ourselves to new possibilities, perspectives, and approaches. That’s why, at DISQO, we welcome, support, and empower individuals from diverse backgrounds. Exceptional teams are rooted in extraordinary people, each with a unique story and a compelling set of skills. DISQO does not discriminate against employees based on race, color, religion, sex, national origin, gender identity or expression, age, disability, pregnancy (including childbirth, breastfeeding, or related medical condition), genetic information, protected military or veteran status, sexual orientation, or any other characteristic protected by applicable federal, state or local laws.

*Recruiting firms that submit resumes to DISQO without first entering into a written contract will not be entitled to any compensation on candidates referred by that firm.

Top Skills

Argocd

Argoworkflow

AWS

Aws Cloudwatch

Aws X-Ray

GCP

Grafana

Jaeger

Kubernetes

Loki

Opentelemetry

Prometheus

Python

Terraform

What the Team is Saying

View all jobs at DISQO

View DISQO Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: Glendale, CA

272 Employees

Year Founded: 2015

What We Do

DISQO’s mission is to build the world’s most trusted ad measurement platform that fuels brand growth. The world’s largest brands, agencies, and media companies trust DISQO for expert insight and AI-driven intelligence about their advertising performance across all platforms. We capture people’s sentiments and journeys, connecting them with the brands they value and the media they consume. With this identity-based approach, brands gain more accurate and authentic insight so they can create more meaningful interactions.

Founded in 2015 and headquartered in Los Angeles, DISQO is recognized as a hyper-growth tech startup and one of the best places to work in the US, with more than 270 team members globally. Follow @DISQO on LinkedIn and Twitter/X.

Why Work With Us

At DISQO, we don’t just hire talent—we champion it. We unlock potential, fuel growth, and raise the bar. Our culture thrives on curiosity, creativity, and courage. Respect is non-negotiable, collaboration is instinctive, and impact is expected. Here, you grow, lead, and redefine what’s possible.

Gallery

DISQO Offices

Learn More

Hybrid Workspace

Employees engage in a combination of remote and on-site work.

In 2023, we implemented a structured hybrid model for employees who live within 50 miles of any of our physical offices (Glendale, CA/New York, NY/Yerevan, Armenia). All other employees are encouraged to visit offices.

Typical time on-site: Flexible

HQGlendale, CA

New York, NY

Learn more