Kyndryl

AgentOps Engineer - Observability

Reposted Yesterday

Be an Early Applicant

4 Locations

In-Office

Senior level

Cloud • Information Technology • Consulting

The Role

As a Senior Observability Engineer, you'll design and maintain observability for AI systems, integrating various monitoring tools and frameworks to ensure reliability and compliance with AI standards.

Summary Generated by Built In

Who We Are

At Kyndryl, we design, build, manage and modernize the mission-critical technology systems that the world depends on every day. So why work at Kyndryl? We are always moving forward – always pushing ourselves to go further in our efforts to build a more equitable, inclusive world for our employees, our customers and our communities.

The Role

We’re looking for exceptional talent to join our AI Agentic Innovation Hub at Kyndryl!
The AI Agentic Innovation Hub stands as Kyndryl’s center of excellence for advanced and agentic artificial intelligence. Our mission is to lead the design and deployment of transformative AI solutions that bridge frontier research with real-world impact — scalable, secure, and driven by measurable value.
Built upon a team of exceptional talent and cutting-edge technology, the Hub embodies a spirit of bold innovation and disciplined execution — an elite unit within one of the world’s leading technology companies. With national reach and global ambition, we partner with major organizations to tackle their most complex challenges, pioneering the next generation of intelligent, autonomous, and trusted systems that redefine what AI can achieve.

Job Description

As a Senior Observability Engineer at Kyndryl’s AI Innovation Hub, you’ll be at the core of operational excellence for next-generation intelligent and agentic systems.
Your mission will be to design, implement, and maintain advanced observability and monitoring capabilities that ensure the reliability, traceability, and performance of AI agents and models in production.
You’ll help build the observability architecture for agentic intelligence — integrating tracing, logging, monitoring, and governance tools that provide a deep understanding of how agents perceive, reason, and act in complex environments.
Your work will enable early detection of anomalies, data drift, performance degradation, bias, or undesired agent behavior, ensuring compliance with the EU AI Act and Responsible AI principles.
If you’re passionate about bridging AI systems with operational intelligence, and about creating frameworks that make AI transparent, accountable, and trustworthy, this role offers a unique opportunity to shape the future of intelligent observability.

Your Mission

Design and implement the observability architecture for AI and Agentic systems, enabling end-to-end visibility across models, agents, and data pipelines.

Develop instrumentation frameworks to collect and analyze technical, behavioral, and cognitive metrics for deployed AI systems.

Integrate and configure monitoring, tracing, and logging tools (Prometheus, Grafana, OpenTelemetry, ELK Stack, Datadog, etc.) to ensure full operational insight.

Build dashboards and alerting mechanisms to detect data drift, performance issues, hallucinations, or reasoning inconsistencies in LLMs and agents.

Collaborate with MLOps, Data, and Architecture teams to establish model lineage, drift detection, and governance pipelines.

Design and maintain custom metrics for model and agent reliability — precision, latency, cost, reasoning depth, autonomy, and consistency.

Contribute to the Responsible AI framework, ensuring transparency, fairness, and auditability in AI decision-making.

Continuously research and experiment with new observability tools and practices (AgentOps, LLMOps, RAG Observability).

Who You Are

Essential Qualifications

4+ years of professional experience, including at least 2 years in AI, MLOps, or distributed systems projects.

Proven experience designing and implementing monitoring, logging, and performance metrics for production systems.

Hands-on expertise with observability tools such as Prometheus, Grafana, OpenTelemetry, ELK Stack, Loki, Jaeger, or Datadog.

Experience instrumenting AI and ML pipelines, tracking inference latency, throughput, and cost metrics.

Familiarity with MLOps and LLMOps frameworks, including model traceability, drift detection, and prompt or reasoning tracing.

Knowledge of agentic frameworks (LangGraph, AutoGen, CrewAI, OpenDevin, Google ADK) and their monitoring needs.

Experience designing custom metrics for precision, reliability, error rate, and cognitive consistency.

Strong understanding of cloud-native architectures, containers, and IaC tools (Kubernetes, Docker, Helm, Terraform).

Awareness of AI compliance and governance requirements (EU AI Act, Responsible AI, decision traceability).

Education & Certifications

Bachelor’s degree in Computer Engineering, Software Engineering, Data Science, or related field.

Postgraduate or specialized training in MLOps, DevOps, Observability, or Artificial Intelligence is highly valued.

Certifications in Cloud Architecture, Monitoring, or AI Governance are a plus.

Continuous learning mindset and commitment to staying current with emerging AI observability frameworks.

Preferred Skills

Experience with model observability and data lineage systems.

Understanding of cognitive observability, including reasoning-chain or decision-path tracing in agents.

Familiarity with event-driven architectures and telemetry for real-time AI services.

Knowledge of FinOps metrics and cost optimization for AI workloads.

Experience developing custom dashboards or visualization plugins for monitoring complex systems.

Comfort working in hybrid or multi-cloud environments (Azure, AWS, GCP).

Strong interest in AI reliability engineering and the convergence of AI and DevOps practices.

Soft Skills

Analytical and systemic thinker, understanding the interplay between data, systems, and agent behavior.

Clear communicator, able to convey complex insights and performance findings to both technical and business audiences.

Quality- and reliability-driven, with a preventive mindset focused on operational resilience.

Collaborative and cross-functional, working seamlessly with AI, data, and compliance teams.

Curious and proactive, exploring emerging technologies and methods in AI observability and AgentOps.

Ethical and responsible, aware of the implications and accountability of automated decisions in production AI.

#AgenticAI

Being You

Diversity is a whole lot more than what we look like or where we come from, it’s how we think and who we are. We welcome people of all cultures, backgrounds, and experiences. But we’re not doing it single-handily: Our Kyndryl Inclusion Networks are only one of many ways we create a workplace where all Kyndryls can find and provide support and advice. This dedication to welcoming everyone into our company means that Kyndryl gives you – and everyone next to you – the ability to bring your whole self to work, individually and collectively, and support the activation of our equitable culture. That’s the Kyndryl Way.

What You Can Expect

With state-of-the-art resources and Fortune 100 clients, every day is an opportunity to innovate, build new capabilities, new relationships, new processes, and new value. Kyndryl cares about your well-being and prides itself on offering benefits that give you choice, reflect the diversity of our employees and support you and your family through the moments that matter – wherever you are in your life journey. Our employee learning programs give you access to the best learning in the industry to receive certifications, including Microsoft, Google, Amazon, Skillsoft, and many more. Through our company-wide volunteering and giving platform, you can donate, start fundraisers, volunteer, and search over 2 million non-profit organizations. At Kyndryl, we invest heavily in you, we want you to succeed so that together, we will all succeed.

Get Referred!

If you know someone that works at Kyndryl, when asked ‘How Did You Hear About Us’ during the application process, select ‘Employee Referral’ and enter your contact's Kyndryl email address.

Top Skills

Datadog

Docker

Elk Stack

Grafana

Helm

Kubernetes

Opentelemetry

Prometheus

Terraform

View all jobs at Kyndryl

View Kyndryl Profile

Report Job

Am I A Good Fit?

beta

Get Personalized Job Insights.

Our AI-powered fit analysis compares your resume with a job listing so you know if your skills & experience align.

The Company

HQ: New York City, NY

46,070 Employees

Year Founded: 2021

What We Do

We have the world’s best talent that design, run, and manage the most advanced and reliable technology infrastructure each day. Together, we think holistically about the health of these vital technology ecosystems.

We are a focused, independent company that builds on our foundation of excellence by creating systems in new ways. Bringing in the right partners, investing in our business, and working side-by-side with our customers to unlock potential. We're raising the bar.

Our experience speaks for itself: We have 90,000 highly skilled employees around the world serving 75 of the Fortune 100. But our purpose is what drives us: Advancing the vital systems that power human progress. Because when a digital ecosystem is healthy, it can more readily adapt and support continuous growth and that opens up a world of possibility for everyone.