Why Do Large Language Models Hallucinate and How Can We Fix It?
This article explains why large language models produce plausible‑looking but false information, traces the problem to the supervised fine‑tuning stage, and outlines mitigation techniques such as knowledge interrogation, RLHF, and tool‑augmented search to reduce hallucinations.
You may have encountered model hallucinations—instances where LLMs generate erroneous, misleading, or completely fabricated information that appears plausible. Hallucinations occur because LLMs do not truly "know" facts; they predict words based on patterns learned from training data. Early models suffered severely, but mitigation strategies have improved the situation, though hallucinations are not fully eliminated.
When prompted with "Who is Zyler Vance?" the falcon-7b-instruct model fabricated a fictional answer, illustrating how older models are prone to hallucination.
LLM Training Process
Understanding hallucinations requires familiarity with the LLM training pipeline, which typically consists of three main stages:
Pretraining
Supervised fine‑tuning (SFT)
Reinforcement learning from human feedback (RLHF)
Pretraining
During pretraining, the model is exposed to massive, high‑quality, diverse text scraped from the internet, learning general language patterns, grammar, and facts. The output of this stage is the base model, a token predictor that estimates the next word in a sequence.
Datasets such as the FineWeb collection exemplify the type of data used for pretraining; major LLM providers maintain internal equivalents.
Supervised Fine‑Tuning
The base model, which merely mimics internet text, is further optimized on dialogue datasets containing hundreds of thousands of multi‑turn, wide‑topic conversations. Human annotators write ideal assistant responses for each dialogue turn, often after researching the correct answer.
One such open‑source dataset is OpenAssistant/oasst1, which contains 161,443 messages in 35 languages.
Reinforcement Learning from Human Feedback (RLHF)
Even after SFT, models can produce misleading, biased, or unhelpful answers. RLHF aligns model outputs with human preferences. Multiple model outputs are generated for a prompt, then human annotators rank or score them based on quality, safety, and alignment. These rankings train a separate reward model—a neural network that simulates human preferences.
The LLM is then fine‑tuned via reinforcement learning, using the reward model to provide feedback and maximize the reward signal, encouraging answers the model "knows".
Why Do Hallucinations Occur?
Hallucinations primarily originate in the supervised fine‑tuning stage. Training data may contain confident but incorrect answers, and during inference the model statistically imitates these patterns, often fabricating plausible‑looking content when faced with unseen queries.
Model Interrogation
To mitigate hallucinations, datasets should include examples where the correct answer is "I don’t know." Determining what a model knows requires interrogating it empirically. Meta’s Llama 3 series employs a knowledge‑interrogation technique: they extract data fragments from pretraining corpora, generate factual questions about them, sample Llama 3’s answers, and then evaluate correctness and informativeness using the original context and the model itself as a judge. Incorrect yet informative answers are used to train the model to refuse or defer.
Extract a data fragment from pretraining data.
Prompt Llama 3 to generate a factual question about the fragment.
Sample Llama 3’s answer to the question.
Score the answer’s correctness using the original context and Llama 3 as evaluator.
Score the answer’s informativeness similarly.
For consistently informative but wrong answers, generate a refusal response.
The resulting data encourage the model to answer only when it is confident and to avoid uncertain queries, gradually reducing hallucinations.
Using Web Search
Beyond saying "I don’t know," a more effective strategy is to let the LLM retrieve factual information via web search. By introducing special tokens such as <SEARCH_START> and <SEARCH_END> , the model can signal when it needs to perform a search, pause generation, query a search engine, and insert retrieved text into its context window, effectively refreshing its working memory.
When the model encounters <SEARCH_START> , it halts token generation, sends the query to a search engine, retrieves results, and places the retrieved passages into its context. Subsequent tokens can then directly reference this concrete information.
Training the model to use such tools requires large datasets with many dialogue examples demonstrating when and how to invoke search, what the search syntax looks like, and how to incorporate results.
Conclusion
Hallucinations are an inherent outcome of the LLM training pipeline, especially stemming from the supervised fine‑tuning stage. While early models suffered heavily, mitigation techniques like knowledge interrogation and tool‑augmented search have significantly reduced hallucinations, though completely eliminating them remains an ongoing challenge essential for trustworthy AI.
Original article: https://medium.com/ai-advances/llm-hallucinations-a95e341d5a7e
Code Mala Tang
Read source code together, write articles together, and enjoy spicy hot pot together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.