Artificial Intelligence 8 min read

Why Do Language Models Hallucinate? Roots, Risks, and a New Evaluation Approach

The article analyzes OpenAI's study on language‑model hallucinations, explaining how statistical limits in pre‑training and flawed binary evaluation incentives cause false answers, and proposes a confidence‑threshold scoring system that rewards honest "I don’t know" responses to improve reliability.

Baobao Algorithm Notes

Sep 9, 2025

Why Do Language Models Hallucinate? Roots, Risks, and a New Evaluation Approach

In psychology, “hallucination” describes the brain’s tendency to fill gaps with plausible but unfounded details; language‑model hallucination is analogous, where models prioritize logical coherence and common‑sense over factual accuracy, sometimes producing convincing yet incorrect outputs.

Blog: https://openai.com/index/why-language-models-hallucinate/
Paper: https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf

Pre‑training Hallucination Sources

During pre‑training, models learn a probability distribution over language from massive corpora. The study shows that generation tasks are harder than simple validity judgments. Using an “Is‑It‑Valid (IIV)” binary task, they derive the bound:

Generation error rate ≥ 2 × IIV error rate

Thus, even tiny classification errors are amplified in free‑form generation, creating hallucinations.

Statistical complexity: When training data lack learnable patterns, models face epistemic uncertainty. Rare “singleton facts” (e.g., an obscure person’s birthday) cannot be verified, so the model guesses based on statistical cues, raising hallucination risk.

Model architecture limits: Early n‑gram models missed long‑range dependencies; modern deep models, while powerful, still struggle to memorize rare facts or handle cross‑domain knowledge.

Data quality issues: Large corpora inevitably contain factual errors, misinformation, or low‑quality text, which the model can reproduce during generation.

Consequently, hallucinations in the pre‑training phase are statistically inevitable rather than anomalies.

Evaluation Incentives Exacerbate Hallucination

The paper notes that fine‑tuning and alignment alone do not eliminate hallucinations; instead, common evaluation benchmarks (e.g., MMLU‑Pro, GPQA, SWE‑bench) use binary scoring: correct = 1, incorrect or "I don’t know" = 0.

This "right‑or‑wrong" scheme pushes models to guess when uncertain, because a model that always answers scores higher than one that honestly admits uncertainty. Even benchmarks that allow an IDK option (e.g., WildBench) often penalize the IDK choice through poorly designed rules, leading models to favor guesses.

Other Hallucination Causes

Computational hardness: Some tasks are theoretically unsolvable without a secret key (e.g., cryptographic decryption), so models inevitably hallucinate.

Distribution shift: When test inputs differ significantly from training data, models may answer common‑sense questions incorrectly (e.g., comparing the weight of a pound of feathers versus a pound of lead).

Data noise: Persistent false facts and misinformation in the corpus appear in outputs, even with alignment or reinforcement‑learning techniques.

Retrieval‑augmented generation (RAG) cannot fully prevent hallucinations; failed retrieval combined with binary scoring still encourages guessing.

Solution: Adjust Evaluation Incentives

The authors propose changing the scoring rule rather than creating a separate hallucination benchmark. Introduce a confidence threshold t: the model may answer only when its confidence exceeds t.

Answer only if confidence > t.

Correct answer: +1 point.

Incorrect answer: – t/(1‑t) points.

IDK answer: 0 points.

Under this rule, saying "I don’t know" is a rational, non‑penalized choice, forcing the model to calibrate its confidence before responding.

Advantages:

Value for uncertainty: IDK becomes a legitimate option rather than a punished blank.

Compatibility with existing benchmarks: The new rule can be added to current suites like SWE‑bench without redesign.

Promotes calibrated behavior: Model accuracy aligns with its confidence, improving overall reliability.

Why "I Don’t Know" Is Progress

Allowing models to admit ignorance demonstrates two key abilities:

Metacognition: Awareness of its own knowledge limits.

Responsible communication: In high‑risk domains (medical, legal), a wrong answer is far more harmful than an honest "I don’t know".

Reforming evaluation to reward honest uncertainty helps models move beyond test‑taking strategies, leading to safer, more trustworthy AI systems.

AI safety language models hallucination Model Alignment confidence threshold

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.