What makes agentic AI different from everything that came before it is the shift in who acts. We used to build tools that waited for instructions. Now we’re building agents that assess a goal, break it into steps and take action without asking permission.
Engineering teams are focused almost entirely on making these agents more capable. But from what I’ve observed across research programs, the projects that stall don’t do so because the model wasn’t good enough. They stall because users didn't trust the system, didn’t understand what it just did or felt they’d lost control of their work.
According to McKinsey’s 2025 Global Survey on AI, 62 percent of organizations are already experimenting with agents., But Gartner expects that more than 40 percent of those projects will get canceled by 2027. Those cancelations will be driven not by technical shortcomings, but by unclear business value. We have plenty of benchmarks for model accuracy but almost nothing measuring why real users abandon agents. That data gap is itself part of the problem UX research needs to solve.
4 User Research Techniques for Agentic AI Systems
- Discovery research for zero-to-one products.
- Research-led AI evals.
- Longitudinal diary studies.
- Co-design workshops.
The AI Delegation Dilemma
With traditional software, you are the operator. You click, you type, you direct every action. With an agent, however, you become the delegator. And delegating is a fundamentally different cognitive experience than doing.
You’re handing over agency in exchange for efficiency. In practice, people hold strong boundaries around what they’re comfortable delegating. Email is a useful example. Most users are happy to let an agent sort their inbox. Ask whether they’d be okay with the agent sending a reply on their behalf, though, and the response shifts entirely.
That threshold is different for every person, every task and every industry. UX research identifies where those boundaries sit so product teams aren't designing based on assumptions.
Designing Agents for Predictable Unpredictability
Agentic AI challenges a core UX principle: These systems are theoretically non-deterministic. Give an agent the same task on Monday and again on Friday, and it may solve it differently both times. For practitioners trained on the expectation that identical inputs produce identical outputs, this variance introduces real design tension. NN/g’s 2025 research agenda for generative AI frames the question well: How do you evaluate a system that changes over time?
Since we can’t guarantee consistency of action, we need to guarantee consistency of intent. This reframing has practical implications for measurement. Instead of tracking whether the agent completed a task the same way twice, you study whether users believe the agent understood what they wanted. You can operationalize that measurement through post-task confidence ratings on goal alignment, think-aloud protocols where users narrate expected versus actual outcomes and comparison studies across different agent paths measuring perceived alignment. The metric becomes perceived alignment, not behavioral consistency.
In one evaluation, we gave 10 participants the same four tasks to complete with a voice AI agent and tracked how the agent’s tone, language and level of detail shifted across sessions. Some users received concise, direct responses, while others got longer, more conversational answers to the same question. In one case, the agent hallucinated a detail entirely, and the user’s trust dropped immediately both for that task and for every one that followed.
What we were measuring wasn’t whether the agent said the same words every time. It was whether users believed it understood their goal and how quickly that belief collapsed when it got something wrong.
From Interactions to Relationships
Most digital products are transactional. Search for a flight, book it, move on. Agentic AI is closer to onboarding a new team member. The agent adapts to your preferences. You learn its tendencies. The relationship evolves over weeks.
Traditional research methods weren't designed for that timeframe. Anyone who has studied automation in aviation will recognize what happens here: monitoring fatigue. The more reliable a system becomes, the worse humans get at catching its errors. Agentic AI introduces an additional layer. Unlike autopilot, agents don’t follow the same procedure every time. Users aren’t just monitoring; they’re supervising something unpredictable while deciding how much supervision it even needs.
In my research, I’ve observed participants toggle between being overly trusting and anxiously double-checking everything in the same session. Although you can’t blindly accept every output, if you’re reviewing every action the agent takes, it isn’t saving you time. The key research question is what kinds of transparency signals, such as citing sources or explaining reasoning, make users comfortable enough to reduce oversight. I’ve explored this question in depth with voice AI trust frameworks, and the patterns hold across modalities: context awareness, clear status indicators and graceful error recovery are what separate agents people use once from agents they rely on.
Research Methods for an Agentic World
Four approaches have consistently proven their value in this new era. A tension runs through all of them: agentic AI teams ship fast. The agent might get updated three times during a four-week study. Every method has to produce an actionable signal at speed.
Discovery Research for Zero-to-One Products
This happens before anyone writes code, and teams skip it most often. The question isn’t, “How should this agent work?” It’s, “Should this agent exist?” Who is the actual audience? What are they struggling with? Do they even want an AI handling this task? I’ve seen what happens when teams bypass this phase. One team built a capable agent for a workflow where users didn’t want autonomy at all; they wanted better tools. Contextual inquiry and concept validation interviews would have revealed this. Two weeks of research can save months of unnecessary engineering.
Research-Led AI Evals
Most evaluation frameworks test accuracy asking whether the agent got the right answer. That question matters, but it’s a poor predictor of retention. Without human-centered criteria, agents can produce so-called agent slop, low-quality output at scale, generated by systems that lack proper guardrails.
Research-led evals add the human dimension. Was the tone appropriate? Did the explanation make sense to a non-technical user? Did the agent proceed when it should have paused for confirmation? These evaluations need to run continuously, with user feedback looping into model tuning and guardrails. Without that loop, you end up with passing dashboards and a product nobody returns to.
Longitudinal Diary Studies
A usability test captures a moment. A diary study captures the arc: initial excitement, followed by the realization that the agent gets things wrong sometimes and does so confidently. Daily logs reveal precise inflection points: when trust eroded, what triggered it and whether recovery was possible.
Co-Design Workshops
Bring users and product teams together to define authority boundaries. Where can the agent act independently? Where does it need explicit permission? A structure I’ve been testing is “defer, propose, lead.” It’s still evolving, but the pattern holds. For high-stakes decisions, the agent presents data and waits. For medium-stakes ones, it offers ranked suggestions. On routine tasks, it handles them and reports afterward. What makes this effective isn’t the framework itself. It’s that users participated in drawing the lines.
The Human Benchmark
Engineering teams prioritize MMLU scores and accuracy rates. Those matter for model performance, but they’re poor predictors of whether someone keeps using an agent past the first week. I’ve observed agents with strong benchmarks get abandoned because the interaction felt opaque or presumptuous.
There’s also a dimension most teams aren’t yet addressing: accessibility. How does a screen reader user supervise an agent taking autonomous actions in real time? How does someone with a cognitive disability manage the delegation decisions outlined above? These are not edge cases. If we aren’t researching these questions now, we are building technology that works for some people and excludes others by design.
15 percent of daily work decisions will be made autonomously by agents by 2028, up from zero in 2024, according to Gartner. The technical infrastructure is advancing rapidly. But none of it succeeds if people don’t trust it, and none of it is equitable if it only works for some people.
The agents that succeed won’t just be the most capable. They’ll be the ones people trust enough to rely on.
