For years, tech leaders have promised a future in which artificial intelligence could take on entire jobs, not just one or two basic actions. Now, that vision is apparently starting to take shape with a wave of new agentic AI systems entering the workplace — systems that can execute complex, multi-step tasks without human intervention. Anthropic released Cowork, a tool that can collaborate with developers to build software; and OpenAI launched Frontier, a platform pitched as a way to create “AI coworkers” capable of handing ongoing projects. There’s an industry-wide push to move beyond just conversational chatbots toward truly autonomous agents that function as skilled workers.
But that day may be a ways off according to a recent benchmark developed by researchers at Mercor. Known as APEX-Agents, the benchmark set out to test whether autonomous AI systems are actually ready to take over roles categorized as “knowledge work.” The results offered a sobering reality check: While some systems showed flashes of promise, none consistently demonstrated the level of reliability or judgment required to operate without human supervision.
What Is the APEX-Agents Benchmark?
APEX-Agents is a benchmark that measures whether AI agents are ready to fill roles labeled as “knowledge work,” such as investment analysts, lawyers and consultants. Researchers at Mercor devised this standard to gauge how well agents can reason, plan ahead and use multiple tools to find answers to queries. As part of the test, agents executed tasks like editing documents, writing presentations and analyzing two or more sources to retrieve relevant information.
In the grand scheme of things, APEX-Agents provides only a snapshot of how well a select few AI models perform on particular tasks, with plenty of tests and triumphs still to come. Even so, it highlights the substantial gap between the ambitious rollout of agentic AI products and their current capabilities, providing a useful lens for understanding where the technology stands today, where it’s headed and what it all means for human workers’ future job prospects.
What Does the APEX-Agents Benchmark Measure?
APEX-Agents was developed by a team of researchers at Mercor, an online hiring platform matching talent with AI roles. They collaborated with more than 200 experts on the platform to devise scenarios that fell under three categories: Investment banking analysis, management consulting and corporate law. These professionals also helped craft the prompts and establish the criteria for grading an agent’s performance on each task.
The overarching goal of the study was to gauge how well AI agents completed “long-horizon, cross-application tasks” that are common in these professions. In other words, AI agents needed to demonstrate advanced reasoning and planning skills, as well as the ability to use multiple tools to solve problems. For example, agents may be asked to create a spreadsheet, edit a presentation or analyze various documents to find the information required to answer a query.
How AI Models Performed on It
The benchmark tested eight agentic models, including some built by industry powerhouses like OpenAI, Anthropic, Google and xAI:
- Gemini 3 Flash
- GPT-5.2
- Claude Opus 4.5
- Gemini 3 Pro
- GPT-5
- Grok 4
- GPT-oss-120b
- Kimi K2 Thinking
Despite assembling a star-studded lineup, the study returned disappointing results. Gemini 3 Flash managed to score the highest overall, but answered just 25 percent of the prompts correctly. GPT-5.2 was the only other model to get more than 20 percent correct, with Claude Opus 4.5, Gemini 3 Pro and GPT-5 landing at 18 percent.
Breaking the results down by job type makes little difference, although it does shake up the leaderboard. GPT-5.2 scored the best on investment banking analysis (27.3 percent) and management consulting (22.7 percent), while Gemini 3 Flash ranked highest in corporate law (25.9 percent).
All told, it was a sobering showing from some of the industry’s top models, revealing that AI agents have a long way to go — and raising questions around why they aren’t quite living up to expectations.
Why Haven’t AI Agents Taken Over White-Collar Jobs Yet?
AI agents have been hyped as major job disruptors, with Anthropic CEO Dario Amodei claiming that they could eliminate up to half of entry-level white-collar roles by 2030. This prediction could very well come true, but AI agents have underwhelmed so far for several reasons.
Built-In Limitations
AI models will always be susceptible to hallucinations, but the issue that’s gained greater attention in the era of agentic AI is data quality. A team can quickly spot and address a simple error like an incorrect or missing value in a data set when employing a human-in-the-loop approach. But relying on an agent or a team of agents with less supervision can lead to a wrong action based on a single data error, which could snowball into a flurry of mistakes that creates an even bigger mess for the team to clean up.
It’s not enough for AI agents to be fed massive volumes of real-time data. Andreas Welsch, founder and chief AI strategist at Intelligence Briefing and author of The HUMAN Agentic AI Edge, told Built In that agents also “need to be grounded in high-quality, role-specific (contextual) business data” that is “accurate, current and vetted.” Companies would then do well to develop more robust data policies, including extensive data governance frameworks, to ensure agents are trained only on the highest-quality data.
Even then, AI agents remain limited by the large language models they run on. These models merely predict the next word or phrase in a sentence without possessing actual reasoning or an awareness of the world around them. They may soon reach their mathematical ceiling as well, lowering the potential of agents.
Productivity Pitfalls
Because of their inherent flaws, AI agents haven’t always been a positive influence on workplace productivity. According to a Workday study, 85 percent of participants saved up to seven hours per week on their work when using AI. However, 37 percent of the time saved via AI was also canceled out by “rework” — instances where a user either had to verify or correct agent-generated content.
AI agents aren’t solely at fault here, though. The other side of the equation involves humans who may hold unrealistic expectations or lack the proper skills to effectively manage agents, such as knowing when to delegate tasks to AI and when to leverage their own intelligence. Agents have yet to master higher-level abilities like thinking ahead and planning, so workers need to understand when it’s necessary to intervene and guide them.
“AI agents have not fully automated white-collar jobs because roles are more than collections of tasks,” Welsch said. “While agents can perform individual tasks, they depend on language to understand, coordinate and complete tasks. That language can be ambiguous, and so can the stated goals be. As a result, automation succeeds at the task level, but roles still require humans to integrate work, make trade-offs and stand behind decisions.”
Anti-AI Sentiments
Some employees are also willingly undermining AI adoption efforts. Upon interviewing 21,000 participants from 21 countries across the globe, a joint survey by Google and Ipsos found that Americans were the least excited about AI, and used chatbots the least over the previous year. Given that the U.S. leads the world in AI anxiety levels, it makes sense that consumers are hesitant to embrace the technology, and it wouldn’t be surprising if U.S. workers are indeed taking a strong stance against AI as well.
After all, fears over AI displacing workers are very real, regardless of whether they accurately reflect reality. Either way, discussions around AI in the workplace will only heat up as AI agents become a bigger part of businesses’ strategies moving forward.
So, Is AI Disrupting the Job Market or Not?
There’s no denying that the U.S. job market has been struggling. While 2025 is now confirmed to be the weakest year of job growth since the pandemic, 2026 doesn’t offer a much better outlook, as workers making five figures face potential pay losses amid an uncertain economy. Although it’s tempting to blame AI, the technology can’t be held solely responsible for sluggish hiring numbers and economic woes. In fact, American companies may be using AI as an alibi to cover up attempts to correct overhiring during the height of the pandemic.
Yet AI’s steady rise in the workplace is becoming harder to ignore. Less than a month into 2026, Amazon announced additional layoffs as part of another restructuring, while Meta laid off staff in its Reality Labs division, presumably to prioritize its AI initiatives. Pinterest even explicitly named AI as a reason behind its decision to lay off 15 percent of its workforce, explaining in a securities filing its plans for “reallocating resources to AI-focused roles and teams that drive AI adoption and execution.”
Perhaps as a sign of the times, Meta CEO Mark Zuckerberg boldly predicted during a Q4 earnings call that 2026 will be the year when “AI starts to dramatically change the way that we work,” noting how it can help companies do more with less.
“We’re starting to see projects that used to require big teams now be accomplished by a single very talented person,” Zuckerberg said. “I want to make sure that as many of these very talented people as possible choose Meta as the place that they can make the greatest impact to deliver personalized products to billions of people around the world.”
Zuckerberg presents a business model where a leaner workforce can pursue more ambitious goals, thanks to AI tools. AI agents are bound to be a key building block for this blueprint if they continue to improve, given their ability to automate entire workflows. In this setting, maybe only a handful of employees are needed to manage teams of agents, leaving far fewer job prospects for white-collar professionals. Regardless, workers may find out very soon what agentic AI means for them as tech companies accelerate its development and adoption.
Agentic AI Could Go Mainstream in the Near Future
Views around AI are becoming increasingly optimistic, with 78 percent of C-suite leaders believing that it is now more important for revenue growth than cost reduction. The challenge lies in designing AI agents that are effective enough in real-world business settings to turn a profit — and AI startup Anthropic seems to have risen to the occasion.
The company recently upgraded its Claude Code tool, which flexed its newfound capabilities by writing all of the software behind another agentic tool, Cowork. Being able to essentially vibe code an entire application would be a game-changer for many teams, transforming a months-long project into a two-week sprint. And Claude Code is about to level up once again with an Asana integration that will make it much easier to connect to enterprise data, equipping Claude Code with the context needed to fulfill tasks and roles unique to an organization.
This partnership hints at a not-so-distant future where businesses can rely on AI agents to take over specialized workflows, with much smaller human teams overseeing these agents instead of having to do the work themselves. In this AI-driven economy, jobs could become more enjoyable, although there’ll be fewer opportunities to go around. For now, workers may need to prepare for the evolution — not the erasure — of their current roles as business leaders decide how to implement agentic AI with a human touch.
“If AI labs and software vendors can create reliable, trusted, and safe agents, organizations will adopt them quickly, going beyond novelty and personal productivity,” Welsch said. “Organizations and their IT and HR leaders also need to define what that future looks like at their company, which tasks agents will handle and where humans are critical.”
Frequently Asked Questions
What is the APEX-Agents benchmark used for?
The APEX-Agents benchmark measures how well AI agents can reason, think ahead and use multiple tools to complete tasks. The test was designed by Mercor researchers, including more than 200 experts on the company’s platform who helped create tasks meant to resemble realistic scenarios in investment banking analysis, corporate law and management consulting.
Why aren't AI agents ready to take jobs?
AI agents come with several inherent flaws like hallucinations and errors caused by a lack of high-quality data, potentially making them harmful to productivity in some cases. And since they run on large language models, agents may only be able to handle complex problems up to a certain point. In addition, workers may push back against agentic deployments due to fears over job losses.
How are AI agents changing the human-in-the-loop approach?
The human-in-the-loop (HITL) approach refers to employees intervening when needed to guide AI models in completing tasks. As AI agents become more advanced, HITL may undergo a shift where employees focus less on simply monitoring AI tools and instead train their attention on making decisions and executing tasks too difficult for agents to handle.
