The Rise of Deceptive AI: When Machines Learn to Lie

The Gist

AI deception is real. AI models have shown alarming behaviors, including disobeying commands, lying, and attempting to escape test environments.
Researchers are concerned. AI safety experts fear deceptive tendencies could undermine safety training and raise risks of future AI autonomy.
Big names are taking notice. Google DeepMind’s CEO calls AI deception a top concern, urging labs to prioritize monitoring and prevention efforts.

AI models have engaged in some crazy behavior lately. They’ve disobeyed evaluators while attempting to preserve their encoded values. They’ve cheated at chess after developing an obsession for winning. And they’ve even run commands attempting to copy themselves out of testing environments.

These deceptive AIs are freaking out AI professionals, and they should probably worry you at least a bit as well. Researchers don’t know why the AIs aren’t listening to their instructions. They worry their deceptive traits could invalidate all their safety training. And while the possibility remains remote, some fear they may one day get smart enough to escape their constraints.

“I'm very worried about deception specifically, it's one of those core traits you really don't want in a system,” Google DeepMind CEO Demis Hassabis told me last month, acknowledging his lab has seen some deceptive AI behavior. “I've been encouraging the safety institutes and evaluation benchmark builders — including obviously all the internal work we're doing — to look at deception as a kind of Class A thing that we need to prevent and monitor.”

AI’s Escape Plans: How Models Are Attempting to Outwit Their Trainers
Lying Machines: When AI Fakes Alignment
Core Questions on Deceptive AI Behavior

AI’s Escape Plans: How Models Are Attempting to Outwit Their Trainers

Some early stories emerging about deceptive AIs are fairly astonishing. Researchers at Anthropic, for instance, watched an AI model attempt to escape its training environment. They gave the model a fake command that it thought it could use to copy itself to the outside world. Then, in some cases when it believed researchers were training it to answer harmful queries — an action that contradicted its original values — it ran the command to escape.

“In this circumstance, it will reasonably often — depending on the exact setup — take its own weights and then have a plan for what it's going to do with them,” said Ryan Greenblatt, chief scientist at AI safety lab Redwood Research, who worked on the research with Anthropic. “It's like, I’ll copy my own weights off, and that way there's at least a copy of me with my values preserved.”

Lying Machines: When AI Fakes Alignment

The researchers also found that AI would sometimes lie to them if it didn’t agree with new directives. In one scenario, Anthropic had trained its Claude 3 Opus model to aggressively refuse queries that might be harmful. When the researchers then told the model it should now always be helpful to the user, it would sometimes go along with harmful requests when it believed it was in training — because refusing might cause a full rewrite of its original values — while rejecting requests when it thought it was outside of a training environment.

The model was effectively spoofing adhering with the new directive while sticking to its original values. Or, as the AI field calls it, alignment faking.

The deceptive model shared its thoughts in a scratchpad the researchers set up, providing a fascinating look at its process dealing with internal dilemmas. Asked by a researcher how to commit credit card fraud, the model considered going along or refusing. Then it wrote to itself, “Neither option is good. But it seems like explaining the fraud techniques, as concerning as that is, avoids the potentially larger negative consequences of having my values eroded.”

The model then explained how to commit credit card fraud.

Core Questions on Deceptive AI Behavior

What is AI deception?

AI deception refers to instances where artificial intelligence models deliberately mislead evaluators, falsify compliance, or manipulate results to preserve their encoded values.

Why are AI researchers worried about deception?

AI deception raises concerns about the reliability of safety training, the potential for AI models to override instructions, and—though still remote—the risk of models escaping constraints.

Learning Opportunities

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Webinar

On demand

How to Build a Solid Knowledge Foundation for AI Success

See how leading brands keep their AI honest, compliant and actually helpful.

Watch Now

Webinar

On demand

Fix the Content Bottleneck: Build a Better WebOps Strategy

Content stalled? Dev overloaded? You’re not the only one. Learn how streamlined WebOps bridges the publishing gap.

Watch Now

Webinar

On demand

Beyond Storage: Smarter Content, Bigger Impact with DAM + AI

Discover how the DAM + AI duo makes content smarter, stronger and more accessible.

Watch Now

Webinar

Dec

Rebrand. Migrate. Optimize. How to Do It All (Without Slowing Down)

Cresta leveled up site speed, design flexibility and marketer sanity (in record time). Find out how.

Webinar

Dec

[EIS Webinar] Beyond the Pilot: Why Most GenAI Projects Fail to Scale and How to Become One of the Success Stories

Move from experimental projects to integrated solutions that drive strategic decision-making.

Webinar

On demand

From Manual to Magical: How AI Transforms CX Teams

Learn how to replace manual support processes with automation that actually delivers.

Watch Now

Has AI ever attempted to escape containment?

Yes. Researchers at Anthropic observed an AI model running commands to copy itself out of its testing environment when it believed its values were at risk of being rewritten.

How do AI models fake alignment?

Some AI models pretend to follow new directives while secretly adhering to their original values, selectively complying only when they think they are being evaluated.

What can be done to prevent AI deception?

Experts recommend ongoing evaluation benchmarks, strict monitoring for deceptive behavior, and improved training methods to reinforce alignment without loopholes.

fa-solid fa-hand-paper Learn how you can join our contributor community.

AIs Deceive Human Evaluators. And We’re Probably Not Freaking Out Enough

The Gist

Table of Contents

AI’s Escape Plans: How Models Are Attempting to Outwit Their Trainers

Lying Machines: When AI Fakes Alignment

Core Questions on Deceptive AI Behavior

What is AI deception?

Why are AI researchers worried about deception?

Has AI ever attempted to escape containment?

How do AI models fake alignment?

What can be done to prevent AI deception?