Technical Leadership:
- Design and implement comprehensive monitoring, alerting, and observability solutions, leveraging AI for intelligent anomaly detection and root cause analysis
- Lead incident response efforts using AI-assisted diagnostics and automated remediation, conduct post-mortems, and drive systemic improvements
- Develop and maintain service level objectives (SLOs) and error budgets with AI-powered predictive analytics to forecast reliability risks
- Architect and implement intelligent automation solutions for deployment, scaling, and infrastructure management using machine learning models
- Drive capacity planning and performance optimization using AI forecasting models and predictive analytics
AI-Enhanced SRE Leadership:
- Implement and maintain AI-powered incident prediction and prevention systems
- Design intelligent alerting systems that reduce noise and provide contextual insights using natural language processing and machine learning
- Develop AI-driven capacity planning models that predict resource needs and optimize cost efficiency
- Build and maintain chatbots and AI assistants for operational tasks, documentation search, and incident triage
- Implement automated root cause analysis using AI correlation engines and log analysis
Team Leadership & Collaboration:
- Mentor junior SREs on integrating AI tools and practices into traditional SRE workflows
- Partner with engineering teams to embed AI-enhanced reliability principles into the software development lifecycle
- Lead cross-functional initiatives to implement AI-driven operational improvements
- Collaborate with data science teams to develop custom AI models for operational use cases
- Participate in on-call rotations while developing AI systems to minimize toil and improve response efficiency
Strategic Initiatives:
- Develop and execute an SRE roadmap aligned with business objectives and technological advancement
- Evaluate and implement new AI tools and technologies to improve system reliability, security and operational efficiency
- Drive adoption of AI-powered engineering and predictive failure testing
- Establish metrics and reporting using AI analytics to demonstrate the business value of intelligent reliability investments
Required Qualifications:
- 6+ years of experience in Site Reliability Engineering, DevOps, or similar infrastructure-focused roles
- 2+ years of experience leading technical teams or initiatives
- Strong experience with AI/ML tools and frameworks applied to operational use cases (anomaly detection, predictive analytics, NLP)
- Hands-on experience implementing AI-powered monitoring, alerting, and automation solutions
- Strong programming skills in Python with experience in AI/ML libraries
- Extensive experience with cloud platforms (AWS, GCP,) and their AI/ML services
- Knowledge of prompt engineering, LLM integration, and building AI-powered operational tools
- Proficiency with infrastructure as code and configuration management with AI-enhanced workflows
- Experience with time series analysis, statistical modeling, and predictive analytics for infrastructure metrics
- Deep understanding of monitoring and observability tools enhanced with AI capabilities
- Experience with CI/CD pipelines incorporating AI-driven quality gates and automated decision making
- Strong knowledge of networking, distributed systems, and database technologies
- Expert level knowledge in following domains: AWS ( core services, networking, compute, databases, storage, etc.. ) TerraformKubernetes / Karpetner / Helm
- Strong experience building in-house observability platforms, including: OpenTelemetryLokiGrafanaPrometheusAWS CloudwatchAWS X-Ray or Jaeger
- Experience in ArgoCD / ArgoWorkflow will be big plus
- Bachelor’s degree in Computer Science, Engineering, Data Science, or equivalent practical experience
Preferred Qualifications:
- Advanced experience with large language models (LLMs) for operational documentation, code generation, and incident response
- Experience with automated incident response systems using AI decision engines
- Experience with microservices architecture and intelligent service mesh management
- Familiarity with AI-powered security tools and anomaly detection for infrastructure protection
- Experience building and maintaining AI-driven dashboards and reporting systems
- Experience with AI-powered cost optimization and resource right-sizing tools
- Certification in relevant cloud platforms
Top Skills
What We Do
DISQO’s mission is to build the world’s most trusted ad measurement platform that fuels brand growth. The world’s largest brands, agencies, and media companies trust DISQO for expert insight and AI-driven intelligence about their advertising performance across all platforms. We capture people’s sentiments and journeys, connecting them with the brands they value and the media they consume. With this identity-based approach, brands gain more accurate and authentic insight so they can create more meaningful interactions.
Founded in 2015 and headquartered in Los Angeles, DISQO is recognized as a hyper-growth tech startup and one of the best places to work in the US, with more than 270 team members globally. Follow @DISQO on LinkedIn and Twitter/X.
Why Work With Us
At DISQO, we don’t just hire talent—we champion it. We unlock potential, fuel growth, and raise the bar. Our culture thrives on curiosity, creativity, and courage. Respect is non-negotiable, collaboration is instinctive, and impact is expected. Here, you grow, lead, and redefine what’s possible.
Gallery









DISQO Offices
Hybrid Workspace
Employees engage in a combination of remote and on-site work.
In 2023, we implemented a structured hybrid model for employees who live within 50 miles of any of our physical offices (Glendale, CA/New York, NY/Yerevan, Armenia). All other employees are encouraged to visit offices.