Can LLMs Ask the Right Questions? Introducing AR‑Bench for Active Reasoning

Large Language Models excel at passive reasoning, but struggle when information is incomplete; this paper defines the active reasoning problem, presents the AR‑Bench benchmark with detective, puzzle, and number‑guessing tasks, and reveals through extensive experiments that even top models like GPT‑4o perform poorly, highlighting research gaps.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can LLMs Ask the Right Questions? Introducing AR‑Bench for Active Reasoning

Background

Large Language Models (LLMs) excel at complex reasoning when full information is provided, especially with Chain‑of‑Thought (CoT) prompting. Existing research mainly studies passive reasoning (PR), where the model receives all necessary data up front.

Active Reasoning (AR)

In many real‑world situations information is incomplete, requiring the model to actively acquire missing clues—similar to detectives gathering evidence or doctors asking follow‑up questions. Active Reasoning (AR) is defined as a paradigm in which an LLM interacts with external sources (databases, APIs, or humans) to ask relevant questions, iteratively refine its answer, and solve the task.

AR‑Bench Benchmark

AR‑Bench evaluates AR capabilities through three task types:

Detective Cases (DC) : Simulated criminal investigations that require clue‑gathering and commonsense reasoning.

Situation Puzzles (SP) : “What‑if” riddles solved via yes/no questioning, testing logical and divergent thinking.

Guessing Numbers (GN) : Classic number‑guessing games that assess symbolic reasoning.

The benchmark follows a multi‑turn interaction format: a questioning LLM converses with a responder agent that supplies information based on the model’s queries.

Evaluation Metrics

Two complementary dimensions are measured:

Result Evaluation : Accuracy for DC and GN; F1‑Score for open‑ended SP answers.

Process Evaluation : Predefined key questions are used with an LLM‑as‑judge to score whether each interaction step effectively addresses the information need (DC and SP). For GN, numeric accuracy of the feedback is computed.

Figure 1: Passive vs Active Reasoning
Figure 1: Passive vs Active Reasoning

Experimental Results

Several state‑of‑the‑art LLMs (including GPT‑4o) and a range of prompting, search‑based (Tree‑of‑Thought), and training‑based (SFT, DPO) methods were evaluated on AR‑Bench. Key observations:

GPT‑4o attains only ~35% accuracy on the GN task.

Fine‑grained guidance and search‑based methods provide minimal gains.

Training‑based approaches sometimes degrade performance.

Advanced active‑reasoning methods (Proactive CoT, Uncertainty of Thoughts) do not substantially improve results, while human participants outperform all tested models.

Figure 4: Model Performance on AR‑Bench
Figure 4: Model Performance on AR‑Bench
Figure 6: Human vs Model Performance
Figure 6: Human vs Model Performance

Ablation Studies

Three ablations were performed:

Fixing interaction information to isolate pure reasoning ability.

Extending the number of interaction rounds.

Assessing the reliability of the responder model.

Findings:

Larger models extract more useful information from fixed records.

Simply increasing interaction rounds does not fully solve AR tasks.

Responder agents provide reliable answers when queried by the primary model.

Figure 9: Interaction Records for Different Model Sizes
Figure 9: Interaction Records for Different Model Sizes

Error Analysis

Typical error patterns across tasks include:

Posing overly broad or irrelevant questions.

Failing to ask helpful follow‑up queries.

Timeline misunderstandings, ignoring evidence, and making unsupported assumptions.

Figure 12: GPT‑4o Error Cases
Figure 12: GPT‑4o Error Cases

Conclusion and Future Work

AR‑Bench provides a systematic evaluation of active reasoning, revealing that current LLMs, despite strong passive reasoning abilities, perform poorly on realistic information‑incomplete tasks. Future directions include:

Collecting high‑quality fine‑tuning data.

Adapting reinforcement‑learning‑based reasoning methods (e.g., PPO, GRPO, DAPO) to AR.

Developing more reliable verification mechanisms for search‑based approaches.

Extending AR‑Bench to domains such as medical assistance, multi‑turn retrieval‑augmented generation, tool use, robotics, and gaming.

For full details, see the paper (ICML 2025) and the code repository at https://github.com/tmlr-group/AR-Bench.

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
benchmarkLLM evaluationActive Reasoning
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.