By now, we’ve all heard about the propensity of artificial intelligence systems to hallucinate, or make things up. But how about causing an AI system to do the wrong thing on purpose, or “poisoning?”
Recent research has shown that it’s a lot easier to do than anyone realized.
Table of Contents
- AI Poisoning Isn't New
- Anthropic Uncovers How Easily AI Can Be Poisoned
- Are Popular AI Models Poisoned?
- How to Prevent AI Poisoning in Custom Models
- How to Address LLM Poisoning in Pre-Trained Models
- Security Experts Say They're 'Playing Catch-Up'
AI Poisoning Isn't New
AI poisoning as a concept isn’t new.
As long ago as 1966, the Star Trek episode “What Are Little Girls Made Of?” showed Captain Kirk — being duplicated against his will into an android — saying, “Mind your own business, Mr. Spock! I’m sick of your half-breed interference, do you hear?”
Sure enough, when android Kirk reached the Enterprise, Spock challenged him and the line came out, tipping Spock off that he was dealing with an imposter.
More specific to AI, there's Microsoft’s ill-fated chatbot: Tay.
Launched in 2016 as a chatbot with the personality of a teenage girl, it was targeted by members of the 4chan website with racist, antisemitic and sexist statements, which it internalized and incorporated into its responses. Less than 24 hours later, Microsoft shut it down.
Numerous other researchers have studied LLM poisoning before this year, ranging from diffusion text-to-image methods to medical LLMs. In fact, the Nightshade project aims to help artists protect their work by “poisoning” it so AI models can’t use it.
Common Questions About AI Poisoning
AI poisoning, also called data poisoning, occurs when someone injects bad data (data that is manipulated) into the dataset used to train an AI model. Because AI learns patterns from its training data, this poisoning causes the model to produce incorrect or biased information or show unpredictable behavior.
One sign AI has been poisoned is if a model that behaves normally most of the time suddenly produces incorrect, harmful or unexpected outputs, especially when certain inputs are used.
Other warning signs of poisoned AI include:
- Outputs skewed toward a particular bias or misinformation source
- Unexplained changes in model behavior after ingesting new data
- Patterns of failure that can’t be reproduced with normal testing
Anthropic Uncovers How Easily AI Can Be Poisoned
Anthropic's latest research — performed in conjunction with the UK AI Institute and the Alan Turing Institute — revealed how little effort it takes to poison AI systems.
Previous research indicated that poisoning would require a number of inputs proportional to the size of the LLM, meaning attempting to poison an LLM trained on billions of parameters would be a near impossible.
These new findings show it took just 250 poisoned documents to poison an AI system. Moreover, that 250 figure stayed constant regardless of the number of parameters on which the LLM was trained, even up to 13 billion.
The poisoned LLMs continued to work normally until the trigger phrase was used. Then, like an AI Manchurian Candidate, they executed their programming, which in Anthropic's use case was either producing gibberish or translating text from English to German. Other research has found that LLMs are not only vulnerable to more than one trigger at a time, but have multiple triggers reinforce each other.
Related Article: Are AI Models Running Out of Training Data?
Are Popular AI Models Poisoned?
A Moscow-based disinformation network called Pravda appears to be sowing incorrect information to AI systems, NewsGuard reported.
AI chatbots repeated falsities laundered by the Pravda network 33% of the time, according to the organization. Not only did all 10 of the chatbots tested repeat the provided disinformation, seven of them cited specific articles from Pravda as their sources.
A February 2025 report noted that the network may have been custom-built to flood LLMs with pro-Russia content. The report added that the network is unfriendly to human users, and sites within the network have no search function, poor formatting and unreliable scrolling, among other usability issues.
How to Prevent AI Poisoning in Custom Models
Anthropic's research revealed a big problem. But the question is, what should organizations do about it?
“For enterprises, mitigation starts with visibility and control,” said Richard Blech, co-founder and CEO of XSOC Corp. He recommended:
- Enforce Data Provenance. Know where every dataset originates and whether it’s been cryptographically signed or verified.
- Anchor Inference Pipelines. Apply cryptographic checksums or secure attestations at each model hop.
- Segment Trust Zones. Separate model training, inference and feedback channels to prevent recursive contamination.
- Adopt Adversarial-Aware Ingestion. Test not just for data quality, but for manipulation potential.
“Poisoning is most feasible when data pipelines ingest from the open web with weak provenance,” said Josh Swords, head of data and AI engineering at Aiimi. “It is much harder against curated, permissioned, auditable data. So for organizations training their own models, they need to take great care assembling their data.”
How to Address LLM Poisoning in Pre-Trained Models
But strategically assembling data is easier said than done.
“Most data processing pipelines on internet scale use heuristics or other models for filtering,” Swords said. “For the majority of firms that use pre-trained models, the poisoning occurs upstream beyond their control. This is difficult to overcome.”
Retrieval-Augmented Generation
Using retrieval-augmented generation (RAG) is one solution, he explained. “But poisoning goes well beyond text, to images and other modalities. These are much harder to detect, as they can look perfectly normal to the naked eye.”
Third-Party Tools
Researchers and vendors are also developing solutions to detect AI poisoning. PoisonBench, for instance, is an open-source benchmark that aims to test LLMs’ vulnerability to poisoning. Fazl Barez, one of the project's researchers, explained that all aspects of the tool are publicly available, and they're happy to help people test their models.
Secure Cognitive Layering
Blech’s company is working on what it calls secure cognitive layering. Unlike traditional LLM hardening, which attempts to secure models after the fact, secure cognitive layering assumes AI systems will be subject to recursive inference attacks — adversarial learning loops where models extract, distort or infer sensitive patterns from data or other models.
“Our framework ensures that every layer of the cognitive architecture, data ingestion, context formation and inference execution, is cryptographically anchored, authenticated and traceable," said Blech.
Related Article: AI Cyber Threats Are Escalating. Most Companies Are Still Unprepared
Security Experts Say They're 'Playing Catch-Up'
As in any security scenario, defenders have to be perfect all the time. Attackers only have to succeed once.
“These ideas are pushing the tide back,” said Michael Morgenstern, partner at DayBlink Consulting. “You can’t ‘just do this and we’re safe again.’ We’re back to data security 101 that we’ve sort of ignored for the past three years."
Organizations still need human-based verification for anything important, he added. But we're still caught in a cycle where excitement pushes capability faster than security. "Security always plays catch-up. We’re playing catch-up.”