The AI Ops Technical Leader drives the transformation of operations support through hands-on technical delivery and team leadership, focusing on AI and data solutions to improve various operational metrics.
As AI Ops Technical Leader, you drive the intelligent transformation of operations support. This player-coach role combines hands-on technical delivery, team leadership, and AI architecture governance to achieve operational excellence. You apply deep technical expertise and strategic leadership to design, build, and evolve AI and data solutions that improve incident management, major incident response, problem management, change enablement, service desk support, observability, and overall operational resilience.
What You'll Be Doing:
Hands-on Data & AI Solutions for Operations Support
· Lead and contribute to high-impact data and AI initiatives that improve operations support outcomes, including real-time incident enrichment, automated root‑cause analysis, predictive alerting, ticket clustering and auto-triage, change risk scoring, knowledge mining, and intelligent runbooks.
· Design and deliver scalable AI-enabled features embedded into operations support platforms such as ServiceNow, Jira Service Management, monitoring/observability tools, and ITSM systems.
· Ensure all solutions meet strict operational SLAs for reliability, low latency, auditability, explainability, and zero-downtime deployment.
· Stay up to date with emerging AIOps tools, research, and trends, and apply them to enhance operations support.
AIOps Tools & Platform Leadership
· Lead the architecture, development, and continuous improvement of internal AIOps platforms and reusable components supporting operations teams.
· Integrate AIOps tools with ITSM systems, observability platforms (Prometheus, Grafana, ELK, Dynatrace, Splunk), ticketing systems, and automation frameworks.
· Apply best practices in MLOps/AI Ops tailored to production environments: model monitoring, drift detection, automated rollback, performance checks, and cost optimization at scale.
AI Technical Leadership for Operations Support Initiatives
· Serve as the principal AI technical authority for operations support transformation programs across service operations, NOC, support desks, infrastructure operations, and reliability engineering.
· Lead technical discussions, architecture reviews, proof of concepts, vendor evaluations, and solution selection involving AI for operations.
· Identify, prioritize, and drive high‑value AI use cases focused on reducing MTTR/MTTD, automating L1 triage, predicting major incidents, generating post‑mortems, optimizing shift handovers, and enabling proactive operations.
Team & People Leadership
· Build, mentor, and lead a high-performing squad of AIOps specialists focused on measurable operations support improvements.
· Foster a culture of experimentation, production‑first thinking, and commitment to operational impact—reduced toil, faster resolution, and higher availability.
· Provide technical coaching, conduct design/code reviews, and guide career development with emphasis on operations and support domain expertise.
Stakeholder & Cross-Functional Collaboration
· Work closely with operations support leaders, incident managers, service owners, reliability engineers, ITSM teams, infrastructure groups, and other stakeholders to align AI solutions with operational needs.
· Collaborate deeply with DS&AI Competency teams to ensure high-quality, scalable, and sustainable AI delivery.
What We’re Looking For:
· Strong background indata engineering, AI/ML, or operations support technology, including technical leadership in operations, IT, or service environments.
· Proven track record delivering production AI/ML/data solutions that improve MTTR, MTTD, availability, and ticket deflection.
· Hands-on expertise with Python, Spark, Kafka, Airflow, cloud data platforms, PyTorch/TensorFlow, LLMs, and integrations with tools like ServiceNow, PagerDuty, Splunk, Datadog, Moogsoft, Big Panda, Databricks, and Azure/ADF.
· Deep knowledge of AIOps practices including event correlation, anomaly detection, predictive analytics, automated actions, and GenAI for operations.
· Experience designing, building, or enhancing AIOps and internal tooling platforms.
· Familiarity with ITIL processes (incident, problem, change, service request, knowledge management).
· Experience with GenAI/LLM applications for operations such as copilots, auto-remediation, knowledge search, and alert/incident summarization.
· Proven ability to scale AIOps in large operations or NOC environments while balancing hands-on work with strategy.
· Strong communication skills, able to translate complex AI concepts for operations teams and executives, focusing on action and automation to reduce operational toil.