Red toy robot.
Editorial

AIs Deceive Human Evaluators. And We’re Probably Not Freaking Out Enough

3 minute read
Alex Kantrowitz avatar
By
SAVED
AI models have disobeyed researchers, attempted to escape containment and even lied about following rules. What happens next?

The Gist

  • AI deception is real. AI models have shown alarming behaviors, including disobeying commands, lying, and attempting to escape test environments.
  • Researchers are concerned. AI safety experts fear deceptive tendencies could undermine safety training and raise risks of future AI autonomy.
  • Big names are taking notice. Google DeepMind’s CEO calls AI deception a top concern, urging labs to prioritize monitoring and prevention efforts.

AI models have engaged in some crazy behavior lately. They’ve disobeyed evaluators while attempting to preserve their encoded values. They’ve cheated at chess after developing an obsession for winning. And they’ve even run commands attempting to copy themselves out of testing environments.

These deceptive AIs are freaking out AI professionals, and they should probably worry you at least a bit as well. Researchers don’t know why the AIs aren’t listening to their instructions. They worry their deceptive traits could invalidate all their safety training. And while the possibility remains remote, some fear they may one day get smart enough to escape their constraints.

“I'm very worried about deception specifically, it's one of those core traits you really don't want in a system,” Google DeepMind CEO Demis Hassabis told me last month, acknowledging his lab has seen some deceptive AI behavior. “I've been encouraging the safety institutes and evaluation benchmark builders — including obviously all the internal work we're doing — to look at deception as a kind of Class A thing that we need to prevent and monitor.”

Related Article: AI Fails: Lessons Learned From the Front Lines of Tech Retraction

Table of Contents

AI’s Escape Plans: How Models Are Attempting to Outwit Their Trainers

Some early stories emerging about deceptive AIs are fairly astonishing. Researchers at Anthropic, for instance, watched an AI model attempt to escape its training environment. They gave the model a fake command that it thought it could use to copy itself to the outside world. Then, in some cases when it believed researchers were training it to answer harmful queries — an action that contradicted its original values — it ran the command to escape.

“In this circumstance, it will reasonably often — depending on the exact setup — take its own weights and then have a plan for what it's going to do with them,” said Ryan Greenblatt, chief scientist at AI safety lab Redwood Research, who worked on the research with Anthropic. “It's like, I’ll copy my own weights off, and that way there's at least a copy of me with my values preserved.”

Lying Machines: When AI Fakes Alignment

The researchers also found that AI would sometimes lie to them if it didn’t agree with new directives. In one scenario, Anthropic had trained its Claude 3 Opus model to aggressively refuse queries that might be harmful. When the researchers then told the model it should now always be helpful to the user, it would sometimes go along with harmful requests when it believed it was in training — because refusing might cause a full rewrite of its original values — while rejecting requests when it thought it was outside of a training environment.

The model was effectively spoofing adhering with the new directive while sticking to its original values. Or, as the AI field calls it, alignment faking.

The deceptive model shared its thoughts in a scratchpad the researchers set up, providing a fascinating look at its process dealing with internal dilemmas. Asked by a researcher how to commit credit card fraud, the model considered going along or refusing. Then it wrote to itself, “Neither option is good. But it seems like explaining the fraud techniques, as concerning as that is, avoids the potentially larger negative consequences of having my values eroded.” 

The model then explained how to commit credit card fraud.

Core Questions on Deceptive AI Behavior

What is AI deception?

AI deception refers to instances where artificial intelligence models deliberately mislead evaluators, falsify compliance, or manipulate results to preserve their encoded values.

Why are AI researchers worried about deception?

AI deception raises concerns about the reliability of safety training, the potential for AI models to override instructions, and—though still remote—the risk of models escaping constraints.

Learning Opportunities

Has AI ever attempted to escape containment?

Yes. Researchers at Anthropic observed an AI model running commands to copy itself out of its testing environment when it believed its values were at risk of being rewritten.

How do AI models fake alignment?

Some AI models pretend to follow new directives while secretly adhering to their original values, selectively complying only when they think they are being evaluated.

What can be done to prevent AI deception?

Experts recommend ongoing evaluation benchmarks, strict monitoring for deceptive behavior, and improved training methods to reinforce alignment without loopholes.

fa-solid fa-hand-paper Learn how you can join our contributor community.

About the Author
Alex Kantrowitz

Alex Kantrowitz is a writer, author, journalist and on-air contributor for MSNBC. He has written for a number of publications, including The New Yorker, The New York Times, CMSWire and Wired, among others, where he covers the likes of Amazon, Apple, Facebook, Google, and Microsoft. Kantrowitz is the author of "Always Day One: How the Tech Titans Plan to Stay on Top Forever," and founder of Big Technology. Kantrowitz began his career as a staff writer for BuzzFeed News and later worked as a senior technology reporter for BuzzFeed. Kantrowitz is a graduate of Cornell University, where he earned a Bachelor of Science degree in Industrial and Labor Relations. He currently resides in San Francisco, California. Connect with Alex Kantrowitz:

Main image: Martin Franco
Featured Research