Psychedelic Brain
Feature

The 5 Tasks Where AI Is Most Likely to Make Things Up

6 minute read
Nelly Njeri avatar
By
SAVED
The biggest AI hallucination risks hide in everyday work.

Over one billion people now use AI in their everyday lives, but the tech's most enduring flaw continues to dog it. AI hallucinations — when a model generates information that’s confidently stated but factually incorrect — cost businesses an estimated $67.4 billion in 2024 alone, according to Forrester's 2025 Enterprise AI Cost Analysis. 

AI hallucination rates

Despite research showing that the rate of hallucinations is growing, nearly half of executives in a 2025 Deloitte survey admitted to making decisions based on unverified AI data. 

The accuracy problem, however, is not uniform across tasks. A model that performs well in summarizing a news article may completely fabricate a court ruling or invent a medication dosage. Understanding which tasks carry the highest risk is the first step in using AI without getting burned. A 2026 report from Open Resource Applications, a company focused on free, human-verified AI tools, examined common tasks people assign to AI and their associated hallucination vulnerability, and the findings are quite striking.

Here are the five tasks where AI is most likely to get things wrong.

Table of Contents

1. Mathematical Calculations (Accuracy: 0.38/1)

People trust AI with numbers, but the report found that out of a score of 1, AI scores 0.38 on mathematical accuracy: roughly two in three calculations are wrong. No other category showed such low numbers.

The reason for this dangerous mismatch is structural. Large language models (LLMs) are designed to predict the most likely next token in a sequence, not to compute. When a language model “does math,” it's simply pattern-matching against examples seen during training and not actually running any arithmetic operation. Everything outside those patterns is a confident guess.

A robot doing bad math

According to the report, "Mathematics or medical fields can use AI only with professionals nearby who can check the work. Otherwise, users may end up with completely wrong data."

What to Do Instead

Instead of giving AI a math problem and stepping away, use it instead to frame the problem or explain a concept. Then apply the actual calculation using Wolfram Alpha, a spreadsheet or something similar. Also, try to cross-verify any figures generated by AI, rather than accepting them at face value.

Related Article: OpenAI: Hallucinations Aren’t a Glitch — They’re a Feature

2. Legal Research and Queries

The stakes are uniquely high when it comes to law. While not included in the Open Resource Applications report, a study from Stanford found LLMs hallucinate between 69% and 88% of the time on sample legal queries. On queries about a court’s written rationale for its judgment, LLMs hallucinated at least of 75% of the time.

In a separate Stanford study, researchers asked various LLMs about legal precedents. The models collectively invented over 120 non-existent court cases, complete with realistic-sounding names and fabricated legal reasoning.

The pattern is established in case law. In the notorious Mata v. Avianca case, for example, lawyers submitted AI-generated citations that included case fabrications, leading to sanctions.

What to Do Instead

Never file or rely on any AI-generated legal citation without checking it directly against the actual court document. Make good use of retrieval-augmented generation (RAG) systems that pull from verified legal databases rather than relying on a model's internal knowledge.

3. Data Analysis (Accuracy: 0.52/1)

AI gets the right answer for analytics only 52% of the time, according to the Open Resource Applications report. That leaves coin-flip odds about data interpretation. The problem boils down to how LLMs are designed to handle incomplete and ambiguous datasets.

Rather than flagging gaps, a model prioritizes producing a fluent, logical-sounding response, which means filling in what it does not know with what seems probable.

This has real consequences, especially in areas like finance. One 2025 study found that AI chatbots hallucinated dozens of finance-related Wikipedia pages. One instance impacted 2,847 client accounts and cost $3.2 million in remediation.

What to Do Instead

Feed AI clean, structured, and complete datasets rather than raw or partial data. Prompt the model to state where it is uncertain or where data is missing, rather than asking it to simply produce output. Cross-validate findings against the original dataset manually.

4. Health, Medical and Fitness Advice (Accuracy: 0.67/1)

One 2025 study analyzed 300 physician-validated clinical cases and found a 64.1% hallucination rate on long cases without mitigation prompts. When structured prompting was introduced, the rate fell only to 43.1%, around one wrong answer in every two.

For open-source models, the case was even worse. Hallucination rates exceeded 80% in medical scenarios. Health, fitness and self-care questions scored the highest in accuracy, with a 0.67/1 rating in the benchmark report.

A robot doctor gives wrong diagnosis

One of the biggest challenges is that medical information is vast, frequently updated and highly context-dependent. That means a response that is accurate for a healthy adult may be dangerous for someone with a specific condition or medication history. This is a layer of context most users fail to provide, and most AI systems don’t ask for it.

What to Do Instead

If you are a healthcare provider, using structured prompts with patient context reduces hallucination rates by a factor of 10. Any AI-generated clinical information should be also be double-checked against peer-reviewed sources before being used to inform patient care.

5. Tutoring, Teaching & Niche Factual Queries (Accuracy: 0.67/1)

Teaching and tutoring questions, as well as highly specific factual questions, each achieve scores of 0.67/1 in the Open Resource Applications report. Two issues intersect here:

  1. Many education-related questions demand not just true answers, but pedagogically sound explanations, which most AI models are not reliably designed to provide.
  2. Niche questions often live in the long tail of any model's training data distribution, where a measure of confidence-based guessing takes the place of actually knowing.

As the report put it, “Teaching is 100% about giving students correct information, and right now, most AIs cannot achieve that. LLMs' output is often wrong when the data given to it is incomplete, or when the larger context is required.” 

The risk is further compounded for students or researchers who rely on AI to find academic references. For instance, a 2026 paper revealed that even state-of-the-art models such as GPT-4o still suffer from 15-20% hallucination rates on citation tasks, rising to 35-55% when the topic is new or niche-specific.

What to Do Instead

When using AI for learning or research, treat its output as a starting point, not a conclusion. Ask the model to explain its reasoning and, wherever possible, point you to specific sources you can then check independently. For academic work, verify every citation directly.

Learning Opportunities

What Makes Hallucinations Most Likely to Occur 

Lynette Sciolla Zulu, an AI & digital strategy expert known for curating strategic insights at the intersection of AI, business and African market growth, offered a clear framework for thinking about hallucination-prone tasks."Hallucinations are most likely when you ask AI to produce something specific, a statistic, a source, a name, without giving it a verified document to draw from. Rather than flag a gap it picks up, the tool typically fills it with output that looks and reads like a real answer, because it's trained to produce an outcome to your prompt.”

AI is reliable, she added, for generative drafting, summarizing and restructuring material you provide. However, it’s largely unreliable for retrieval-based tasks.

“If you are in the second category, treat the output as a starting point that needs checking, and build the habit of prompting caution explicitly, for example: 'ensure that any sources this references are reliable, authentic and not derived from synthetic data'." 

This delineation of generative cases from retrieval-based ones is one of the most useful heuristics we've got for everyday use. If you're asking the model to work with material you have given it, the risk is lower. However, if you're asking it to retrieve something from its own training memory, that's where hallucinations thrive.

Related Article: The AI Accuracy Trap: Why MLOps Needs a Financial Circuit Breaker

The Bottom Line

Knowledge workers now spend an average of 4.3 hours per week fact-checking AI outputs, according to Forrester's 2025 enterprise AI cost analysis. That time cost alone should reframe how organizations think about AI deployment: not as a finished product, but as a capable first draft that still needs a human with domain expertise to review.

The five tasks above are not edge cases. They rank high among what people ask AI to do every day. Mathematical calculations, legal research, data analysis, medical advice and teaching questions all run a meaningful risk of producing outputs that sound right but are actually wrong.

No model has solved hallucinations. The top-performing model on average across the five tasks evaluated in the Open Research Applications report was Gemini 3 Pro (Preview). Still, it too will make mistakes at scale if used pervasively across all question-answering tasks.

The mitigation is not a better model alone; it’s a better process. Verification steps, anchored retrieval systems, explicit decomposition into sub-questions and human backup are not an artifice or an engineering one-off. They should be functioning parts of the system today.

FAQs About Hallucination-Prone AI Tasks

The biggest red flag is specificity without evidence. Be cautious when AI gives exact numbers, legal citations, medical claims, academic references, product specs or source names without linking to verifiable documents. Hallucinations often sound confident, which is exactly why they’re easy to miss.

A good test: ask the model, “What source did that information come from?” Then check the source yourself. If the source does not exist or does not say what the model claims, treat the output as unreliable.

Prompts that ask AI to retrieve facts from memory are much riskier than prompts that ask it to work with information you provide.

For example, “Summarize the uploaded report” is generally safer than “Find the latest statistics on AI adoption.” The first task is grounded in supplied material. The second asks the model to recall or seek out information.

High-risk prompt types include:

  • “Give me statistics about…”
  • “What does the law say about…”
  • “Diagnose this symptom…”

Better prompting can reduce hallucinations, but it can't eliminate them. Prompts like “Say you don’t know if you aren’t sure” or “Only answer using the sources provided” help, but models can still fabricate details or overstate weak evidence.

The better approach is to pair prompting with process: source grounding, citation checks, human review, tool-based calculation and clear rules for when AI output cannot be used without verification.

Typically not. AI systems don't “know” in the human sense. They generate likely responses based on patterns. That means a model can present a wrong answer with the same confidence and tone it uses for a correct one.

This is why confidence is a poor signal of accuracy. A model saying “certainly” or “based on the evidence” does not mean the evidence exists.

Sources are especially prone to hallucination because models are trained to produce text that looks like an answer. Academic papers, legal cases and statistics follow recognizable patterns: author names, journal titles, dates, case names, docket numbers, URLs. The model can imitate that structure without having retrieved a real source.


About the Author
Nelly Njeri

Nelius Njeri is a content strategist who partners with marketing leaders across lifestyle, fintech, B2B SaaS and professional services to deliver blog content, website copy and thought leadership — without losing the human voice. Connect with Nelly Njeri:

Main image: rolffimages | Adobe Stock
Featured Research