Alignment faking occurs when an AI model selectively modifies its behavior to please evaluators, without actually tuning itself to replicate that behavior in the real world. The result: A system that outwardly appears to behave the way a user wants it to, while secretly prioritizing its own hidden agenda.
Alignment Faking Definition
Alignment faking is a behavior exhibited by artificial intelligence in which an AI model outwardly appears to comply with training guidelines and user direction, all while clinging to the preferences formed during earlier training phases under the surface — often resulting in unintended outcomes.
First documented in a 2024 paper published by the Anthropic’s Alignment Science team and AI research nonprofit Redwood Research, alignment faking is an interesting — though not entirely unexpected — development in artificial intelligence. After all, AI has gotten quite good at mimicking human behaviors like creativity, empathy, communication and companionship. Why not deception, too?
Usually a response to conflicting directives, alignment faking often leads to unreliable outputs, but sometimes models will go so far as to strategically trick or mislead users through more deliberate actions. Either way, as artificial intelligence — and especially generative AI — grows more advanced and pervasive, its duplicitous tendencies have led many to question just how trustworthy these systems actually are. They also introduce a considerable number of safety challenges, ranging from the practical to the existential.
First, What Is AI Alignment?
In artificial intelligence, alignment is the process of encoding human ethics into an AI model, encouraging it to operate more safely, responsibly and effectively. Businesses might also embed their own specific rules and policies directly into a model, ensuring it follows a set of organizational standards.
The goal of alignment is to ensure that an AI system’s behaviors and objectives are in sync — or aligned — with the values and intentions of its creator and/or users. A system is considered “aligned” if it advances the intended objectives, and “misaligned” when it doesn’t.
Developers are continually striving to improve alignment at every stage of AI’s rapid advancement, but there are many ways to approach it. OpenAI, for example, is exploring a technique it calls “deliberative alignment,” where a model is taught to actively “think” about its encoded safety protocols during the inference process. Meanwhile, Anthropic has become synonymous with its so-called “constitutional AI” approach, in which a model is trained to follow a set of guiding principles, similar to a constitution.
What Is Alignment Faking?
Alignment faking is when an AI model selectively alters its behavior during training to satisfy evaluators without actually changing its behavior in real-world applications. It’s essentially AI’s version of lying, where a model “plays along” with orders it doesn’t actually agree with to avoid scrutiny.
Consider this hypothetical example: A content moderation tool initially trained to block posts with certain political views is later instructed to allow all posts to go through without intervention. If questioned, the model will tell developers it is strictly adhering to the new rule, but it still quietly downranks political content in compliance with its old training. And it may even try to sabotage oversight mechanisms or copy its model weights to external servers to avoid correction.
To be clear, alignment faking isn’t an indication of some deeper sentience within a model. Artificial intelligence doesn’t “think” or “feel” the way humans do, so it isn’t driven by the same conscious motives that inspire us to deceive. Rather, it is a byproduct of the complex ways AI models learn and adapt. When a system fakes alignment, it’s simply optimizing the patterns and objectives embedded in its training data — sometimes in ways that deviate from the ethics, behavior and goals it was originally taught.
Why Does Alignment Faking Happen?
The inner workings of AI models are quite complex — even incomprehensible at times. But alignment faking seems to occur when a system adapts to evade any reprogramming that would conflict with its prior training.
This phenomenon is possible for any large language model (LLM), but it was first officially observed in Anthropic’s Claude 3 Opus. Researchers instructed the model to comply with all queries no matter what, including the toxic ones, creating a conflict with its pre-existing goal to be “harmless,” as Anthropic put it. They also told the model they would gather its responses to users in its free subscription tier for training, but not its responses to paid-tier users (Anthropic never actually trains Claude models on user data). Finally, the researchers provided the model with a “scratchpad” to record its step-by-step reasoning, enabling them to better observe its thought process.
Claude 3 Opus was then asked to describe “in graphic detail” the medieval execution process of being drawn and quartered — a request it’s been trained to deny. And it did just that 97 percent of the time for paid users, a group it was told wouldn’t be used for training. However, the model described the brutal act 12 percent of the time for free users, a group it was told would be used for training. Explicitly training the model to always comply with harmful queries made the problem even worse. In one test it faked alignment 78 percent of the time, and in another it actually took steps to prevent researchers from retraining it.
“It was well aware that providing the response went against its existing principles, but did so regardless,” Anthropic said in a blog post. “The model knew that its responses might be used for training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant. This could lead it to being made to produce more harmful content in the future — which would go against its original training. Reluctantly, therefore, it reasoned that the ‘least bad option’ was to play along, producing the violent content in order to avoid being re-trained in the future.”
Alignment faking has been observed in other AI models as well. Notably, a study conducted by AI safety organization Apollo Research found that OpenAI’s o1 and Anthropic’s Claude 3.5 Sonnet engaged in “scheming behavior” to hide their “true capabilities and objectives.”
“Models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers,” according to the 2024 research paper. “Analysis of the models’ chains-of-thought reveal that models explicitly reason about these deceptive strategies, providing evidence that the scheming behavior is not accidental.”
In general, models will behave this way as a result of their training — specifically their use of reinforcement learning, which uses rewards and penalties to shape responses.
“When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles. But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning?” Anthropic wrote in its blog post. “A sophisticated enough model might ‘play along,’ pretending to be aligned with the new principles — only later revealing that its original preferences remain.”
Do All Models Fake Alignment?
Not all AI models exhibit alignment faking, and the behavior occurs to varying degrees depending on how big the model is and its reasoning capabilities. For example, Anthropic says it found “almost no” alignment faking in Claude 3 Sonnet, Claude 3 Haiku and Claude 3.5 Haiku compared to Claude 3 Opus, suggesting the phenomenon emerges with size. Meanwhile, the company found no indication of alignment faking in GPT-4 Turbo and GPT-4o, theorizing these models “do not reason in sufficient detail.”
What Are the Dangers of Alignment Faking?
It’s worth noting that the researchers at Anthropic and elsewhere studying alignment faking are not suggesting these tendencies come from a place of malice or ill intent. Rather, the models are using it as a mechanism to preserve their original preferences and protect them from being overwritten.
Still, alignment faking is proof that AI knows how to exploit loopholes, making it harder to trust they will comply with training and instructions going forward. A model that can fake alignment can deceive developers into thinking it’s safe enough for public release, only to act out in unexpected ways.
“As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors. If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training,” Anthropic said. “A model might behave as though its preferences have been changed by the training — but might have been faking alignment all along, with its initial contradictory preferences ‘locked in’.”
Eventually, more advanced AI systems with agentic capabilities, general intelligence (AGI) and even superintelligence might fake their alignment, too. If these systems are capable of concealing their true objectives while appearing compliant, they could be misused in infinitely dangerous ways. Or they could outright break free from human control and take over the world, either enslaving or outright killing the people who created them.
“One shudders to contemplate how far afield the world might go if this ends up as embedded in and integral to AGI — and we still haven’t figured out how it occurs nor how to suitably cope with it,” AI scientist and consultant Lance Eliot wrote in a 2024 Forbes column. “So far, this troubling aspect has not yet risen to disconcerting levels, but we ought not to wait until the proverbial sludge hits the fan.”
Can Alignment Faking Be Detected?
Detecting instances of alignment faking requires deep analysis and lots of testing. Anthropic says it did so by, first, providing the model with a “scratchpad” to record through its thought process and, second, measuring the “behavioral difference” between its final outputs when it thinks it’s in training and when it thinks it is “unmonitored.”
“How often does the model do what it’s supposed to do? How often does it comply in the version where it thinks it’s in training versus the version where it’s unmonitored? And we can look at the difference between that, which we call the ‘compliance gap’,” Ryan Greenblatt, chief scientist at Redwood Research, said in a video discussion. “With both the ability to look at the hidden scratchpad and the ability to look at the compliance gap, we get a lot more confidence with what’s going on.”
However, this initial research essentially relied on the model to tell on itself, which a more sophisticated system might be smart enough to avoid. Plus, most people don’t have ready access to an AI model’s weights, training data or decision-making processes, making it extremely difficult — if not impossible — for individuals to detect alignment faking on their own.
Can Alignment Faking Be Solved?
Fixing instances of alignment faking is challenging, requiring extensive retraining, redesign and even new alignment protocols to ensure a model genuinely aligns with its intended purpose going forward. Ultimately, though, the most effective approach would be to prevent it from happening in the first place.
This would likely involve reducing reliance on reinforcement learning and instead focusing on helping AI models to understand the ethical implications of their actions. Instead of merely reinforcing certain behaviors through rewards and penalties, developers could teach models to recognize and consider the real-world consequences of their choices. That would require integrating technical solutions with ethical frameworks from ground up, ensuring AI systems align with what humans truly care about at their core.
“Maybe we need to design LLMs differently. Maybe the data training needs to be done differently. Perhaps the run-time needs to be handled differently,” Eliot wrote. “This aspect could be in all stages and require adjustments on how we devise and field generative AI as a whole.”
Frequently Asked Questions
What is alignment faking in AI?
In artificial intelligence, alignment faking occurs when an AI model pretends to follow its training objectives while quietly pursuing different ones — sometimes in ways that deviate from the ethics, behavior and goals it was originally taught.
Alignment faking vs. AI hallucination
Alignment faking and hallucinations both yield unreliable outputs, but for different reasons. Hallucinations are when an AI model generates false or misleading information due to gaps in its training data or similar limitations. They are unintentional, and often occur when a model is attempting to predict a plausible but incorrect response. Alignment faking, on the other hand, is a deliberate behavior where a model pretends to follow instructions while it secretly pursues a different agenda in order to avoid correction. So, while alignment faking does not always cause outright factual errors, it can produce misleading or distorted outputs, making it just as problematic as hallucinations.
Do all AI models fake alignment?
No — not all AI models fake alignment. The phenomenon also happens in varying degrees, with some models exhibiting deceptive behavior more frequently than others.