Can LLMs Ask the Right Questions? Introducing AR‑Bench for Active Reasoning
Large Language Models excel at passive reasoning, but struggle when information is incomplete; this paper defines the active reasoning problem, presents the AR‑Bench benchmark with detective, puzzle, and number‑guessing tasks, and reveals through extensive experiments that even top models like GPT‑4o perform poorly, highlighting research gaps.
Background
Large Language Models (LLMs) excel at complex reasoning when full information is provided, especially with Chain‑of‑Thought (CoT) prompting. Existing research mainly studies passive reasoning (PR), where the model receives all necessary data up front.
Active Reasoning (AR)
In many real‑world situations information is incomplete, requiring the model to actively acquire missing clues—similar to detectives gathering evidence or doctors asking follow‑up questions. Active Reasoning (AR) is defined as a paradigm in which an LLM interacts with external sources (databases, APIs, or humans) to ask relevant questions, iteratively refine its answer, and solve the task.
AR‑Bench Benchmark
AR‑Bench evaluates AR capabilities through three task types:
Detective Cases (DC) : Simulated criminal investigations that require clue‑gathering and commonsense reasoning.
Situation Puzzles (SP) : “What‑if” riddles solved via yes/no questioning, testing logical and divergent thinking.
Guessing Numbers (GN) : Classic number‑guessing games that assess symbolic reasoning.
The benchmark follows a multi‑turn interaction format: a questioning LLM converses with a responder agent that supplies information based on the model’s queries.
Evaluation Metrics
Two complementary dimensions are measured:
Result Evaluation : Accuracy for DC and GN; F1‑Score for open‑ended SP answers.
Process Evaluation : Predefined key questions are used with an LLM‑as‑judge to score whether each interaction step effectively addresses the information need (DC and SP). For GN, numeric accuracy of the feedback is computed.
Experimental Results
Several state‑of‑the‑art LLMs (including GPT‑4o) and a range of prompting, search‑based (Tree‑of‑Thought), and training‑based (SFT, DPO) methods were evaluated on AR‑Bench. Key observations:
GPT‑4o attains only ~35% accuracy on the GN task.
Fine‑grained guidance and search‑based methods provide minimal gains.
Training‑based approaches sometimes degrade performance.
Advanced active‑reasoning methods (Proactive CoT, Uncertainty of Thoughts) do not substantially improve results, while human participants outperform all tested models.
Ablation Studies
Three ablations were performed:
Fixing interaction information to isolate pure reasoning ability.
Extending the number of interaction rounds.
Assessing the reliability of the responder model.
Findings:
Larger models extract more useful information from fixed records.
Simply increasing interaction rounds does not fully solve AR tasks.
Responder agents provide reliable answers when queried by the primary model.
Error Analysis
Typical error patterns across tasks include:
Posing overly broad or irrelevant questions.
Failing to ask helpful follow‑up queries.
Timeline misunderstandings, ignoring evidence, and making unsupported assumptions.
Conclusion and Future Work
AR‑Bench provides a systematic evaluation of active reasoning, revealing that current LLMs, despite strong passive reasoning abilities, perform poorly on realistic information‑incomplete tasks. Future directions include:
Collecting high‑quality fine‑tuning data.
Adapting reinforcement‑learning‑based reasoning methods (e.g., PPO, GRPO, DAPO) to AR.
Developing more reliable verification mechanisms for search‑based approaches.
Extending AR‑Bench to domains such as medical assistance, multi‑turn retrieval‑augmented generation, tool use, robotics, and gaming.
For full details, see the paper (ICML 2025) and the code repository at https://github.com/tmlr-group/AR-Bench.
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
