How to Make LLMs Recognize and Resolve Their Own Uncertainty
This article introduces ConfuseBench, a benchmark that classifies LLM uncertainty into document‑missing, ability‑limited, and ambiguous types, and presents methods—including retrieval, chain‑of‑thought, and clarification—to detect and actively resolve uncertainty, improving answer quality across diverse tasks.
Background
Large language models (LLMs) have achieved impressive performance in text generation, QA, code writing, retrieval, and tool use, but they often exhibit over‑confidence on out‑of‑knowledge questions. Existing work typically adopts a conservative strategy of replying "I don't know," which misses opportunities to obtain better answers when uncertainty is resolvable.
We distinguish three uncertainty types: document‑missing (key facts absent), ability‑limited (question complexity exceeds model capability), and ambiguous (question is unclear or has multiple interpretations). Models should first identify the source of uncertainty and then apply targeted strategies such as retrieval‑augmented generation (RAG), chain‑of‑thought (CoT), or clarification.
Dataset Construction
2.1 Construction Method
We build ConfuseBench to systematically evaluate LLM uncertainty. Based on the three uncertainty categories, we create cases in three typical scenarios:
Basic QA : uses HotpotQA and AmbigQA to test knowledge‑intensive tasks.
Assistant : employs ExpertQA and TechQA to simulate real‑world AI‑assistant interactions.
Tool Usage : leverages ToolBench to assess reasoning and decision‑making when external tools are required.
For ability‑limited cases, we inject noise at the document level (randomly remove parts of the reference) or at the retrieval level (add distractors). If the model fails to answer correctly after noise, the uncertainty is labeled as document‑missing.
For ambiguous cases, we construct AMR parsing graphs from queries, then apply noise injection (deleting modifiers, masking key information, perturbing relations) and regenerate natural‑language queries with clarification texts.
2.2 Data Validation
We evaluated mainstream LLMs (GPT‑4o, DeepSeek‑V3, Llama‑3, Qwen‑2.5) on ConfuseBench. The Uncertainty Classification Accuracy (UCA) of even the strongest models hovers around 50%, with weaker models performing worse. Models tend to classify most uncertain queries as ambiguous , frequently requesting clarification instead of attempting retrieval or deeper reasoning.
Method
3.1 Uncertainty Judgment
We observe that even when models cannot directly identify the uncertainty source, they often generate reasonable inquiry questions. By analyzing the content of the generated inquiry, we classify uncertainty:
If the inquiry asks "Beijing or Shanghai?" → ambiguous .
If the inquiry asks for additional documents → document‑missing .
If the model repeats the original question or gives vague responses → ability‑limited .
We further simulate environment feedback to the inquiry and use the model’s answer to reinforce the classification.
"A belief‑test style method: give the model a hypothesized answer, then ask it to generate a different answer."
3.2 Theoretical Proof
We formalize the problem with query x, answer y, knowledge document d, and clarification query c. Let θ be model parameters; Uc, Uk, and Ua denote uncertainty from ability, knowledge gap, and ambiguity respectively. The theorem states that a high‑quality inquiry preserves the original uncertainty type, while a meaningless inquiry indicates inability to understand the query.
3.3 Experimental Results
Using ConfuseBench, our on‑policy DPO method improves uncertainty source detection. The model generates an inquiry, attempts to answer it, and if the answer resolves the original query, the inquiry is marked as chosen ; otherwise it is rejected . This feedback loop enhances both inquiry quality and classification accuracy.
Application
The dataset construction and uncertainty‑recognition methods are deployed in Alibaba Mama’s AI assistant (AI Xiao Wan) for tasks such as fuzzy‑intent clarification and inquiry generation, improving the quality of marketing‑assistant interactions.
Conclusion
We introduced ConfuseBench, a benchmark that evaluates LLMs’ ability to diagnose uncertainty sources, and an on‑policy DPO approach that leverages inquiry generation to improve this capability. Experiments show that current models still struggle, often defaulting to ambiguous classifications, but our methods provide a pathway toward more self‑aware and effective LLMs.
References
[1] Xiong M. et al., "Can LLMs express their uncertainty?", arXiv:2306.13063, 2023.
[2] Li J. et al., "Know the unknown: An uncertainty‑sensitive method for LLM instruction tuning", arXiv:2406.10099, 2024.
[3] Feng S. et al., "Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi‑LLM Collaboration", arXiv:2402.00367, 2024.
[4] Qian C. et al., "Tell me more! Towards implicit user intention understanding of language model driven agents", arXiv:2402.09205, 2024.
[5] Wang R. et al., "Retriever‑and‑Memory: Towards Adaptive Note‑Enhanced Retrieval‑Augmented Generation", arXiv:2410.08821, 2024.
[6] Abbasi Yadkori Y. et al., "To believe or not to believe your LLM: Iterative prompting for estimating epistemic uncertainty", NeurIPS 2024.
[7] Chen Y. et al., "TourRank: Utilizing large language models for documents ranking with a tournament‑inspired strategy", ACM Web Conference 2025.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
