Bits AI Is Datadog’s New DevOps Copilot

Software Engineer Aurora Dai explains her role in building Datadog’s AI-powered observability product and the impact it will have on customers.

Written by Mia Goulart
Published on May. 02, 2024
Datadog team having a discussion in the NYC HQ office.
Image: Datadog
Brand Studio Logo

Essential business infrastructure generates data from numerous sources, but extracting actionable insights from this data can be difficult — and time-consuming. That’s why Datadog Software Engineer Aurora Dai and her team created Bits AI, an AI-driven DevOps copilot that helps clients quickly investigate and tackle incidents across the company’s platforms.

 

HOW BITS AI WORKS

Imagine a client receives an alert in the middle of the night about high error rates for "user service." Opening the Datadog mobile app, they might ask Bits AI, "What's going on with user service?" Bits AI quickly identifies an ongoing incident with the "database service" along with other key insights such as the recent deployments and relevant alerts.

With this insight, the client can:

  • Quickly find the incident affecting their service
  • View other affected services
  • Look for other related logs and metrics

 

Thanks to Bits AI, the client can efficiently triage the issue without switching between multiple tools — and they're still able to catch some z's.

 

 

“We have already received a lot of positive feedback,” Dai said. “Our goal is to build an incident responder that works with clients, just as a DevOps engineer would.”

Built In sat down with Dai to learn more about the process of creating Bits AI and the impact the tool will have on customers in the months, and years, ahead. 

 

 

Aurora Dai
Software Engineer II • Datadog

 

What is a product your team is excited about? 

We’re excited about the launch of Bits AI, a new Gen AI-powered DevOps copilot built by engineers, for engineers. Bits AI helps customers query, explain and understand their Datadog systems. It comes with a wide feature set including incident management assistance and natural language querying.

 

What gave rise to Bits AI, and what impact will this product launch have on the business and its customers?

When large language models started becoming powerful, we realized that they were quickly emerging as a tool we should be leveraging for our customers — especially since many of our customers work with extremely complex infrastructures. We assembled a squad of engineers from across the company to quickly build and ship a product that we would be proud to launch at the company’s annual user conference, DASH. Since its launch in August, we’ve been expanding and deepening our feature set by working directly with customers to gather their feedback and iterate on the foundation we built. 

We hope to use this as a starting point to show customers the value of the product. This is also our first step in building an assistant that helps manage and remediate incidents faster. 

 

Datadog launches Bits AI at their annual user conference, DASH.
Datadog

 

What role did you play in developing and launching the product? What tools or technologies did your team use to build the product and why?

From the beginning, I have been one of the core engineers on the project. I work on every aspect of the product including the customer-facing interfaces, the backend services, the underlying databases and everything in between. 

Uniquely, all of Bits AI is backed with an LLM. The tool was built to be model agnostic from the very start so we can support our customers using any model. We also employ many techniques to ensure the LLM returns high-quality results. For example, we use retrieval-augmented generation heavily for natural language querying by augmenting the LLMs results using past and popular queries. Naturally, we’re also heavy users of vector databases, which has been an interesting aspect of this project to learn more about.

What truly sets Bits AI apart is how everything fits together, because Bits AI is powered by the underlying data that our customers trust us to manage. This means Bits AI understands what a client’s services are and how they affect one another and also can help wade through that complexity to serve clients the data they need.

 

What obstacles did you encounter along the way? How did you successfully overcome them? 

Because the team and technology are so new, everything we do is greenfield. This has the benefit and drawback of us being able to define where we want the product to go and how we build it. We haven’t built everything perfectly. Tradeoffs were made while we were learning and we’re always cleaning up after ourselves, but it’s nice to be able to build how and what we want. 

Since the product is so new, we have the unique opportunity to define what to build; however, this also means that product requirements tend to be less defined than a more mature product, compounded by the fact we are working with a burgeoning technology. I find that very motivating, but it can also be challenging.

Additionally, Bits AI has a unique advantage because everyone is excited by the possibilities of what LLMs provide. For such a nascent product, we’ve already seen positive results, which is the greatest motivator of all. Working on a team with such intelligent, dedicated and passionate teammates helps, of course. 

 

What teams did you collaborate with to get this across the finish line? What strategies did you employ to ensure that cross-functional collaboration went smoothly?

Because Bits AI spans such a large swath of Datadog, we partnered with engineers from across the company. We integrated with a huge breadth of products, including logs, RUM and incident management, to name a few. The first iteration of Bits AI was a companywide effort, but since then we have relied on embeds and cross-team efforts, as well as an overall collaborative environment. 

 

“Collaboration is the reason for the tool’s effectiveness.”

 

More notably though, we have a deep relationship with data science. We have been operating in lockstep from day one and doing so allows us to understand the problems that the team is facing and them to see the direct impact their work has on customers. This deep collaboration is the reason for the tool’s effectiveness. 

 

 

Responses have been edited for length and clarity. Images provided by Datadog.