AI Hallucinations Are Getting Worse. What Can We Do About It?

Summary: Hallucinations in large language models are rising again, with models like GPT-o3 and DeepSeek-R1 generating more false outputs than their predecessors. Causes include memory compression, flawed reinforcement learning and unreliable reasoning chains. Mitigation will require better tools, agents and training.

Since the advent of LLMs, generative AI systems have been plagued by hallucinations — their tendency to generate outputs that may sound correct but have no basis in reality. Since the launch of ChatGPT, hallucination rates seemed to steadily decrease, as models and training sets grew larger and training techniques improved.

Recently, though, that trend seems to have reversed. Media coverage and industry benchmarks both confirm that recent models, including the flagship models from OpenAI (GPT-o3) and China’s DeepSeek (DeepSeek-R1) are prone to much more hallucination than their predecessors. Why is that, and what does it mean?

What Kinds of Tasks Make LLMs Hallucinate the Most?

Selecting between options.
Fulfilling a task.
Connecting between steps.
Making judgments.

More on Solving the Hallucination ProblemHow Observability Can Help Solve Hallucinations in Your AI Implementation

Types and Frequency of AI Hallucinations

First, we must look at the various types of hallucinations:

Thinking Hallucinations

An LLM might “invent” a scientific paper and try to search for that paper.

Tool Call Hallucinations

An LLM might decide to search for a given keyword, but when calling the search tool, it adds the name of a non-existent organization.

Fabricated Document Hallucinations

An LLM might hallucinate the existence of an official document that seemingly contains the exact information it needs and then repeatedly search for that non-existent document rather than for other data.

Gullibility Hallucinations

An LLM might hallucinate a source’s significance. For example, it may accept an appearance in a random (possibly even AI-generated) blog entry as the original source of an important factual claim, even when it was already aware that objectively higher-quality sources are available.

Hallucination rates for every step taken by an agent have barely budged from GPT-4 Turbo (0.019) to Claude 3.7 Sonnet (0.014.) Confirming recent reports, DeepSeek-R1 is a notable outlier at 0.159, and o3 is prone to hallucinations.

When thinking/reasoning models like o1, o3, DeepSeek-R1 and Sonnet Thinking emerged, there was some hope that their chains of reasoning would tend to notice, identify and mitigate LLMs’ tendency to hallucinate. Sadly, this is not the case. Thinking models are no less hallucinatory than older, non-thinking models; in fact, some of them, including o3 and R1, hallucinate significantly more.

“o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims,” according to OpenAI, but other benchmarks do not support the idea that the increased number of hallucinations corresponds to an increased quantity of outputs; o3 simply seems to hallucinate much more.

Why Do AI Hallucinations Occur?

Modern frontier models are trained on many trillions of words of written data along with enormous amounts of other modes of data, like video, images and audio. All of this information is trained into a few hundred billion, or at most a few trillion, parameters. The unavoidable result is compression.

For example, in the same way that you remember especially vivid scenes, characters and images from a movie, but not every word of dialog and every color of every costume piece, it simply isn’t possible for LLMs to recall all of their training data perfectly. It’s remarkable they retain as much as they do.

One side effect of this is that the smaller the model, the more the compression and the higher the hallucination rate. One would expect GPT-4.1-mini to hallucinate less than GPT-4.1-nano but more than GPT-4.1, and OpenAI’s own hallucination evaluations confirm just that. Its largest model, GPT-4.5, hallucinates least. Even on the company’s own evaluation leaderboard, though, o3 remains a notable outlier, hallucinating much more than one would expect.

LLMs also hallucinate (albeit far less often) when given data to analyze or summarize, however, or when they’re structured as agents performing chains of tasks against such data. They seem to be especially prone to hallucination during the following.

Selecting Between Options

False-positive gullibility hallucinations are common.

Fulfilling a Task

There is a tendency to stop at minimum-effort answers, and hallucinations are often minimum-effort.

Connecting Between Steps

Facts found earlier may drop out of memory, so the model resurrects them by making them up.

Making Judgements

Training data models what humans do, which is not necessarily good judgement, so models do not necessarily learn good judgement.

That last point provides one possible explanation for why hallucination rates have increased. Frontier models like o3 and R1 are increasingly the products of reinforcement learning and not just pretraining. Reinforcement learning against objectively scorable tasks likely encourages hallucination, especially when their reward function involves a term that penalizes token count.

For example, in the case that a model has a piece of crystallized knowledge that will help it solve a task, the model can get more reward by retrieving that knowledge from its weights than by checking for it in a source. An unintended consequence of reinforcement learning may be less reading and more (unreliable) remembering.

What Are the Implications of Worsening Hallucinations?

The worsening state of hallucinations has several implications for the technology.

1. The LLM Trust Gap Will Remain a Chasm

Even frontier agents still produce what my co-founder, Lawrence Phillips, says is “egregious, non-humanlike failures” that a junior analyst could catch instantly. LLMs can achieve extraordinary things and automate huge efforts, but we still must check what they generate prior to using it. For the foreseeable future, we will need humans in the loop to verify LLMs’ work.

2. Hallucination Mitigation Will Be a Key Aspect of Product Design

Although o3 itself suffers from worsened hallucinations, ChatGPT o3 with web search performs double-checking, which yields relatively good results. Similarly, products that inject extra validation will outperform those which do not.

3. Benchmarking Matters

Hallucination alone explains only a slice of performance variance; improving other skills without addressing hallucination yields diminishing returns.

More in Artificial IntelligenceWhy Successful Agentic AI Adoption Begins With Honesty and Good Data

How Can We Reduce Hallucinations?

A few steps can help to mitigate this problem, however.

1. Better Tools With Better Mitigations

Developing high-quality, reliable research primitives — e.g. the ability to not just find but verify and find the original source of facts — powered by LLMs is critical, as other LLM research tools lack these.

2. Improved Agents

Better search techniques, answer validation and data persistence outside of LLM contexts (i.e. not forcing LLMs to try to accurately remember everything they did the last time we invoked them) will lead to more reliable agents.

3. Better Reinforcement Learning

If the recent rise of hallucinations is, in fact, an unintended consequence of expanded reinforcement learning, future LLM training will require more nuanced and subtle RL scoring (reinforcement learning) in order to cease encouraging hallucination. For instance, during RL training, models should get higher scores for fetching facts from reliable sources, and only moderate scores for simply remembering them correctly, even though the former is substantially more work.

Hallucinations happen for a variety of reasons, and their numbers are increasing. Reducing them will require LLM-powered research primitives and an improvement in search practices and LLM training.