Mapping LLM Reasoning: Paradigms, Methods, and Failure Modes in a Periodic Table
This 103‑page survey of over 300 recent papers organizes large language model reasoning into a periodic‑table framework, explains where reasoning emerges, categorizes 36 method families across six dimensions, critiques accuracy‑only evaluation, and outlines key open challenges such as fidelity, robustness, calibration, generalization, efficiency, and safety.
Introduction
LLMs excel at question answering, programming, math, retrieval, and multimodal tasks but reasoning remains ambiguous. Reasoning is externalized in the context stream, attention/MLP, next‑token distribution, sampling, and search. Chain‑of‑Thought (CoT), tool calls, retrieved evidence, and intermediate drafts become part of the context for subsequent token generation.
Methodology
Structured literature review collected over 300 papers from arXiv, Semantic Scholar, Google Scholar, Papers with Code, and the ACL Anthology. Selection criteria: topical relevance, representation of the past five years, methodological clarity, experimental evidence, novelty, and reproducibility. Distribution shows emphasis on code/algorithm reasoning, meta‑reasoning, Retrieval‑Augmented Generation (RAG), reinforcement learning (RL), and CoT, with fewer works on commonsense, temporal, multi‑hop, and social‑cognitive reasoning.
LLM Reasoning Periodic Table
36 method families are organized in a 6×6 grid. Columns represent combinatorial paradigms: stepwise decomposition, domain specialization, context grounding, enhanced reasoning, learning & reflection, and cross‑boundary reasoning. Rows represent hierarchical levels from basic techniques to higher‑order cognition.
Key Paradigms
Stepwise Decomposition & Multi‑hop Reasoning – CoT makes implicit problem solving explicit, aiding state maintenance and error checking; benefits are strongest for symbolic, mathematical, and multi‑step tasks.
Mathematical, Code, and Verifiable Reasoning – Small errors break entire solution chains; trends include architectural enhancements, synthetic data, process supervision, search‑based verification, and Olympiad‑level evaluation.
RAG, Tool Augmentation & Agent Reasoning – Extends knowledge access, enables interaction with calculators, code interpreters, and APIs; improves real‑time information, precise computation, and long‑term memory while introducing tool‑selection errors, execution failures, and safety boundaries.
Reinforcement Learning & Meta‑Reasoning – RL optimizes search strategies, reasoning‑chain quality, and computational budgeting; meta‑reasoning treats the model’s own thinking as an object to plan, critique, and improve across three layers.
Multilingual, Social, and Safety‑Focused Reasoning – Highlights cross‑language transfer challenges, cultural knowledge gaps; a “social reasoning paradox” where CoT can amplify confidence and bias, requiring bias‑aware metrics and multi‑model fusion.
Unified Workflow
Question → Decompose → Retrieve → Step → Verify → Aggregate → Answer. Paradigms differ in whether they require decomposition, external evidence, tool execution, verification, or aggregation.
Synthesis and Open Challenges
There is a gap between visible accuracy gains and invisible reasoning quality. Core evaluation dimensions are:
Fidelity – does the reasoning chain truly lead to the answer?
Robustness – stability under re‑phrasing, evidence order changes, or distractors.
Calibration – model awareness of uncertainty, safe refusal, or tool request.
Generalization – transfer to new domains, languages, modalities, and longer time spans.
Efficiency – cost‑benefit of longer CoT, more sampling, or additional tool calls.
Safety – new biases or attack surfaces introduced by tool‑enabled agents, social reasoning, and multilingual deployment.
Conclusion
The survey consolidates LLM reasoning research into a structured map, acknowledges genuine progress from CoT, RAG, tool use, RL, and meta‑reasoning, yet current abilities remain statistical, fragile, and distribution‑sensitive. Evaluation must probe why answers are correct, fidelity of the process, robustness across scenarios, computational cost, and associated risks.
Paper: https://arxiv.org/abs/2606.11470
Code example
来源:专知
本文
约4000字
,建议阅读
5
分钟
系统梳理了 300 多篇近年论文,试图用“周期表”的方式组织 LLM 推理研究。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
