How These Teams Tackle Speedy, Safe and Effective AI Releases

Want AI innovation to work? Make sure ideas meet reality — faster.

That’s the approach delivery management platform provider Bringg takes to AI releases, according to Head of AI and Data Eden Hayik.

“At Bringg, our rule for fast and safe AI releases is simple: Speed creates clarity,” he said.

Hayik shared that, because he and his peers’ planning and execution decisions directly affect fulfillment and delays for customers, they ensure their stack is as effective as possible by prioritizing key metrics. For instance, when working with agentic systems, they measure both error rate and softer behavioral signals that are essential to ensure trust and usability.

Celonis, which specializes in helping businesses unlock and capitalize on hidden value through process mining, also takes a multi-faceted approach to speedy and effective AI releases. To ensure AI production goes smoothly, Tech Lead Brian Oppenheim said the company recently created a “Reliability and Developer Experiences” org, bringing together various tech teams to coordinate efforts to standardize and streamline releases, testing, monitoring and alerting.

And so far, these efforts have led to the inception of impactful solutions. According to Oppenheim, the company recently built and launched an internal tool for building, managing and initiating the deployment of releases.

“With our new offering, we define the release process in one place, ensuring the steps are consistently run across all services and that artifacts are named using the same patterns,” he said.

For Hayik, Oppenheim and employees from Inato, Nisos and Analytics8, the key to speedy, safe and effective AI releases comes down to having the right processes and metrics in place. Read on to see what each one had to say about how their teams approach AI production and recent automations that have impacted their businesses.

Brian Oppenheim

Tech Lead, Developer Tools • Celonis

Celonis leverages process mining and AI to create a digital twin, or dynamic digital model, of an organization’s end-to-end processes, providing a common understanding of how the business operates as well as where hidden value can be unlocked and how to capitalize on this value.

What’s your rule for fast, safe releases — and what KPI proves it works?

Releases need to be easy to run, well-tested with as much automated as possible, well-monitored, well-communicated and frequent. More frequent releases push teams to adopt more streamlined and automated release processes, ensure everyone is well-versed in executing the release process, speed up the process of making breaking changes that require multiple releases, and minimize the number of cherry-picks.

To measure ourselves, we look primarily at the DevOps research and assess release frequency and failure rate metrics. While there are plenty of other metrics we could look at, these two alone are well-correlated with good release outcomes. Release frequency provides a signal of the developer experience of releases, while failure rate provides a check to ensure users don’t get a backseat to developer convenience.

“Release frequency provides a signal of the developer experience of releases, while failure rate provides a check to ensure users don’t get a backseat to developer convenience.”

What standard or metric defines “quality” in your stack?

Celonis is increasingly becoming a platform that is deeply integrated in our customers’ daily operations. Because of this, it is critical that our service is highly available and processes their ingested data as quickly as possible. We measure this by looking at system availability and the latency between data being available from our customer to when their users can get value from it inside Celonis.

To ensure that these goals are strongly prioritized, we created a “Reliability and Developer Experiences” org in 2025 that brings together site reliability engineering, developer tools, application delivery, engineering operations and engineering productivity. RDE coordinates efforts across the company to raise the bar for the “always-on” experience our users rely on. The teams are coordinating efforts to standardize and streamline releases, testing, monitoring and alerting.

Name one AI/automation that shipped recently and its impact on the business.

This year, we built and launched an internal tool for building, managing and initiating the deployment of releases. Celonis manages most of our software releases via GitHub actions. Prior to building our new tool, each team had their own workflows, naming conventions and release practices. While many of the workflows were copied from the same template, every team’s copy has diverged in different ways over time. With our new offering, we define the release process in one place, ensuring the steps are consistently run across all services and that artifacts are named using the same patterns. This central approach allows us to easily add steps to the process for things like security scanning, Federal Risk and Authorization Management Program compliance, and other central initiatives without needing to coordinate workflow updates with every development team. It also allows us to have a single data set on the health of our release process.

Celonis is Hiring | View 193 Jobs

Marc FRICOU

Engineering Manager • Inato

Inato’s platform connects community-based research sites with clinical trials across the globe.

What’s your rule for fast, safe releases — and what KPI proves it works?

Our rule is simple: Make releasing so seamless that engineers focus on solving customer problems, not managing releases. We achieve this with a fully automated continuous integration/continuous development pipeline combined with feature flags for safety.

The process is straightforward: Start a branch from main; implement changes behind a feature flag; run automated tests; optionally request reviews; and merge to main and auto-release to staging and production.

This setup lets us release to production more than 10 times a day while maintaining confidence and stability. Our key KPIs are deployment frequency and the low rate of production incidents, proving that speed and quality can go hand in hand.

“Our key KPIs are deployment frequency and the low rate of production incidents, proving that speed and quality can go hand in hand.”

What standard or metric defines “quality” in your stack?

For us, the gold standard is the change failure rate, or the percentage of releases that introduce production issues. It shows whether we’re maintaining quality while moving fast.

We also monitor mean time to recovery because failures are inevitable, but how quickly and safely we recover defines our resilience. Every incident sparks a blameless post-mortem where we capture lessons and update our processes.

While we don’t track every metric in real time, our goal is clear: Release quickly, detect issues early, recover fast, and improve continuously.

Name one automation that shipped recently and its impact on the business.

At the beginning of this year, we adopted Axiom, a real-time analytics and observability tool similar to Grafana, to bring deep visibility into both front-end and back-end performance. We use Axiom to aggregate key performance metrics and traces, from largest contentful paint for page speed to query-level data like N+1 queries and query waterfalls.

With dashboards to visualize these metrics, Axiom helps us: Detect bottlenecks early and uncover hidden performance issues; prioritize work based on data, not assumptions; and measure the real impact of every change we deploy. For example, Axiom showed us that some of our key pages had poor performance, which led us to prioritize a dedicated optimization project. As a result, those pages are now loading more than twice as fast, directly improving the user experience. Thanks to Axiom, performance improvements at Inato have gone from being reactive and ad-hoc to data-driven, measurable, and continuous.

Inato is Hiring | View 4 Jobs

Eden Hayik

Head of AI and Data • Bringg

Bringg’s delivery management platform is designed to automate manual processes, optimize order delivery and create new customer experiences across complex last-mile operations.

What’s your rule for fast, safe releases — and what KPI proves it works?

AI innovation fails when ideas take too long to meet reality. At Bringg, our rule for fast and safe AI releases is simple: Speed creates clarity. In an environment shaped by dynamic demand, changing driver availability and real-time delivery commitments, we follow two core rules that allow us to innovate quickly while staying in control.

The Core Rules Hayik’s Team Follows to Innovate Quickly and Effectively

“Build, Learn, Adjust, and Fail Fast: AI cannot be tested like traditional code. Because AI systems are non-deterministic, real validation only happens when they are exposed to real delivery scenarios and customer workflows. We work in short cycles that move ideas quickly from concept into production-like reality. Features are treated as hypotheses and validated early, allowing us to surface edge cases, gather real feedback, and decide quickly whether to continue, pivot, or stop before risk and cost accumulate.”
“Speed Enables Safety: Fast iteration only works with strong observability. Agents operate with strict boundaries and no access to customer private information. Safety becomes a continuous feedback loop that scales with speed, enabling rapid innovation without losing control in mission-critical delivery operations.”

What standard or metric defines “quality” in your stack?

Quality in our AI stack is defined by how reliably the system behaves in real last-mile delivery workflows and how well its performance matches user expectations in time-critical operations. At Bringg, where planning and execution decisions directly affect fulfillment, delays and customer satisfaction, quality starts with end-to-end workflow reliability. For agentic systems, we use error rate as a strict signal for functional failures and pair it with direct customer feedback to continuously tune behavior. Because agent behavior is non-deterministic, error rate alone is not enough, and softer behavioral signals are essential to ensure trust and usability.

“Quality in our AI stack is defined by how reliably the system behaves in real last-mile delivery workflows and how well its performance matches user expectations in time-critical operations.”

Performance is equally critical. User-facing AI must be fast and responsive to support real-time decision-making. When agents require heavier computation or deeper reasoning, quality means respecting those constraints by running them in the background and surfacing results as actionable insights, rather than blocking operations. This balance ensures AI improves efficiency without disrupting mission-critical delivery flows.

Name one automation that shipped recently and its impact on the business.

One AI initiative we recently built is our Capacity Planning Agent. The agent is designed to help customers plan delivery capacity more effectively in environments where demand and resource availability change frequently, a problem that is often handled today through manual decisions or intuition. It provides short-term recommendations for the coming days and supports more strategic planning by highlighting how different choices could impact operational outcomes.

Beyond its direct customer value, this project has had a significant internal impact. Building and validating this agent reshaped how we think about the potential of AI agents across the company, demonstrating how complex, judgment-based planning tasks can be supported by agents that analyze data, simulate outcomes, and surface actionable insights. As a result, it has become a reference point for identifying other domains where agentic systems can drive meaningful business value.

Bringg is Hiring | View 5 Jobs

Eddie Jin

AI Engineer • Nisos

Nisos’ intelligence-led platform and solutions help enterprises make critical decisions, manage human risk, and drive real world consequences for digital threats.

What’s your rule for fast, safe releases — and what KPI proves it works?

My rule is to always ship versioned, reversible releases. Every release includes clearly versioned capabilities — not only models, but also prompts and test set changes — so we can reproduce and safely roll back if needed.

Before anything goes live, each release needs to be benchmarked offline using a representative “gold” dataset at a meaningful scale. This allows us to validate performance even when online traffic isn’t constant and, most importantly, enables fast iteration with controlled uncertainty — what we think of as fast, safe releases.

The KPI that proves this works is statistically significant improvement across the full evaluation set, not isolated wins. We look for aggregate gains that demonstrate generalization, ensuring we’re improving real-world performance rather than overfitting to a narrow slice of data.

“We look for aggregate gains that demonstrate generalization, ensuring we’re improving real-world performance rather than overfitting to a narrow slice of data.”

What standard or metric defines “quality” in your stack?

In our stack, quality is defined across multiple layers of the system. At the output level, we evaluate precision, recall, relevance (semantic similarity), and groundedness, ensuring responses are accurate, properly referenced and have low false-positive and false-negative rates.

However, for agentic and tool-using LLM systems, output quality alone is insufficient. We also measure intermediate system behavior, including tool selection accuracy, step-level failure rates and recovery paths. This allows us to assess whether the system is reasoning correctly, using the right tools at the right time, and failing safely when uncertainty is high. Strong explainability and observability — via traces, logs and structured signals — are essential to making quality measurable, debuggable and trustworthy at scale.

Name one automation that shipped recently and its impact on your team.

One impactful release was deep social-media-enhanced pivoting, designed to identify personally identifiable information and account exposure by analyzing subjects’ social presences across platforms, in addition to traditional open-source intelligence platforms available on the market.

This automation helps close part of the gap between machine accuracy and human investigative expertise. Tasks that once took analysts hours — or even days — of manually reviewing posts, profiles and interactions can now be performed at scale and speed. We’re still on the journey of teaching machines to reason the way human experts do, but the impact is already clear: We can improve investigative efficiency, operate faster at scale, and also reduce tedious, repetitive work, allowing analysts to focus on deeper, higher-value investigative challenges.

Nisos is Hiring | View 2 Jobs

Chris Domain

Managing Consultant • Analytics8

Analytics8 is a data and analytics consultancy that specializes in a wide range of services, from AI analytics and cloud migration to data governance and data architecture.

What’s your rule for fast, safe releases — and what KPI proves it works?

I always compare an agent to an intern when talking to a client about using AI in production. It can take a lot of work off of your plate and even surprise you with how well it handles something complex, but you must validate everything before approving output.

I use Visual Studio Code to review changes made by the agent directly inside edited files. The agent’s updates appear inline, so I can keep, change or discard each one. It’s like reviewing a pull request with each prompt. Validating one piece of work at a time keeps accuracy high.

Model Context Protocol tools add a layer of safety by giving agents clear process tools for how to run specific actions. This cuts down unpredictable behavior. Sometimes, however, the agent skips the tool, and I need to adjust the prompt to ensure it uses the tool to complete the requested action.

What proves it works is time to value. For large tasks, it’s far quicker for engineers to review migrated SQL that the agent creates and documents than if they built it manually. Additionally, since adding MCP tools to our workflow, I’ve seen a drop in rework and re-prompting. We’re able to maintain accelerated AI speed without sacrificing quality.

“Since adding MCP tools to our workflow, I’ve seen a drop in rework and re-prompting.”

What standard or metric defines “quality” in your stack?

Quality in my stack means the agent understands my environment and builds accurate work without forcing me into repeated re-prompts or rework. Quality also means multiple engineers get consistent results when they ask the agent for help. Hallucinations and invented logic don’t meet the standard, so I actively reduce them.

Domain’s Steps For Maintaining Quality in His Team’s Stack

“First, I give the agent detailed context Markdown files. When it misreads something, I update the docs to explain acronyms, define business workflow limits, and prevent mixing legacy enterprise data warehouse logic with logic from our new warehouse implementation.”
“Second, I train engineers to write exact prompts. They tell the agent where to build, what folders to check for legacy logic, and what context to follow to reduce hallucinations.”
“Third, I extend using quality MCP tools. I push the agent to use them when needed because they produce steady, repeatable results, remove random behavior, and give every engineer the same reliable output.”

Name one AI/automation that shipped recently and its impact on your team.

During a recent onsite hackathon focused on clearing our toughest backlog items, we built and deployed an MCP-powered agent inside Visual Studio Code. It automates one of the most time-consuming parts of a modernization project: analyzing legacy transformation logic and generating initial dbt models.

The agent connects to Snowflake through managed MCP servers and to Collibra metadata through a custom MCP server we built. It pulls metadata from Collibra, interprets legacy logic, and then uses dbt and Snowflake servers to recreate that logic, declare sources, build staging layers, draft first-pass intermediate models, create tests, and generate documentation.

The process used to take days of manual untangling, Now, with a few sharp prompts, the agent delivers solid first passes in minutes. In the hackathon alone, we completed what would normally be a two-week sprint in just a couple of days. The automation removed most of the repetitive translation work, produced surprisingly strong early versions of complex models, accelerated our client’s modernization timeline, and allowed them to focus on higher-value warehouse design and governance decisions instead of reviewing logic by hand.

Analytics8 is Hiring | View All Jobs

Featured Companies

The Core Rules Hayik’s Team Follows to Innovate Quickly and Effectively

Domain’s Steps For Maintaining Quality in His Team’s Stack

Recent Articles