Why RL‑Trained Agents Still Fail to Reason Actively: The Information Self‑Locking Problem

The paper reveals that outcome‑based reinforcement learning often traps LLM agents in an information self‑locking regime where weak action selection and belief tracking prevent proper credit assignment, and introduces AREW, a lightweight advantage‑reweighting method that restores active reasoning across multiple tasks and models.

Machine Heart
Machine Heart
Machine Heart
Why RL‑Trained Agents Still Fail to Reason Actively: The Information Self‑Locking Problem

Active Reasoning in LLM Agents

LLM agents that interact over multiple turns must actively acquire missing information (e.g., asking questions, searching, invoking tools) and integrate new evidence into their internal belief about the task. This capability is termed active reasoning .

Problem: Outcome‑Based RL Fails to Train Active Reasoning

Standard outcome‑based reinforcement learning (RL) provides a reward only at the end of a trajectory. In active‑reasoning settings this reward cannot clearly credit two tightly coupled processes:

Action Selection (AS) : choosing the next interaction based on the current belief.

Belief Tracking (BT) : updating the internal belief after receiving feedback.

When either AS or BT is weak, the other receives insufficient learning signal, leading to a self‑reinforcing low‑information regime called Information Self‑Locking (ISL) . In ISL, agents repeat ineffective actions, over‑rely on initial judgments, and ignore new evidence, even though the final reward may still improve.

Formalization of ISL

Weak AS yields uninformative feedback; weak BT fails to absorb that feedback; both capabilities stagnate.

The authors formalize ISL as a region in the AS‑BT capability space where both dimensions are low, causing the outcome‑gradient to be vanishingly small for either component.

Proposed Solution: AREW

The paper introduces Action‑Selection & Belief‑Tracking Advantage Reweighting (AREW) . AREW leverages readily available, coarse directional signals—e.g., whether an action produced useful evidence or whether a belief update moved closer to the ground‑truth—to re‑allocate credit inside a trajectory without modifying the final outcome reward, the critic target, or the core PPO/GRPO/GSPO algorithm.

Implementation details:

For each trajectory, steps identified as positive (useful) are collected as positive_steps; steps identified as negative (harmful) are collected as negative_steps.

An intra‑trajectory likelihood margin is constructed by increasing the log‑probability of positive_steps and decreasing that of negative_steps.

This amounts to adding a lightweight advantage correction to the policy‑gradient update (see Section 4.2 of the paper).

Two variants are evaluated:

AREW‑AS‑only : reweighting only the AS side.

AREW‑AS+BT : reweighting both AS and BT sides.

Experimental Setup

Four interaction domains covering nine active‑reasoning tasks are used:

Preference Estimation (pairwise comparisons, PE‑G and PE‑F settings).

Medical Diagnosis (MediQ).

Troubleshooting (FloDial‑Hard).

Customer‑Service / Tool Use (tau2‑bench Telecom, both solo and dual‑control settings).

Models: Qwen2.5‑7B‑Instruct and LLaMA‑3.1‑8B‑Instruct . Optimizers: PPO, GRPO, and GSPO. All tasks provide only terminal supervision; directional critiques are derived from task‑specific heuristics (e.g., whether a question yields new diagnostic information, whether belief confidence moves toward the ground‑truth).

Results

Across 28 PPO configurations, AREW improves the final reward in 27 settings with statistical significance. Key observations:

AREW‑AS‑only already raises both AS and BT proxy metrics, showing that better information acquisition creates richer feedback for belief updates.

AREW‑AS+BT yields the strongest BT improvements, confirming the benefit of jointly correcting the two coupled abilities.

Training curves show faster convergence and higher asymptotic performance compared to vanilla PPO.

In the tau2‑bench Telecom solo setting, AREW raises reward from ~0.20 to ~0.50, reduces tool‑execution errors, and does not increase response length or interaction turns.

In the dual‑control setting (assistant + GPT‑4o‑simulated user), AREW shifts policy away from over‑reliance on user‑side shortcuts toward more autonomous tool use.

Figure 1 (original) illustrates the self‑locking mechanism; Figure 2 visualizes reward, AS, and BT dynamics; Figure 8 (phase‑space diagram) shows how directional critiques introduce an effective update field inside the locking regime.

Robustness to Noisy Critiques

Randomly flipping a fraction of the directional critiques degrades performance gracefully. Even with substantial noise, AREW remains competitive with or superior to the baseline, confirming that perfect supervision is not required. Additional ablations (removing AS or BT critiques, truncating the critique sequence, or replacing critiques with constant labels) lead to the same qualitative conclusions.

Effectiveness Across RL Algorithms

AREW also improves performance when applied to group‑based RL variants GRPO and GSPO, demonstrating that increasing rollout samples alone does not resolve the AS/BT credit‑assignment issue.

Key Takeaways

Final outcome reward alone is insufficient to judge active‑reasoning competence; agents can improve reward without truly learning to acquire or use information.

Failures often stem from the coupling of AS and BT rather than a single weak component.

Weak, uncalibrated directional signals can substantially improve credit assignment without dense rewards.

The approach is broadly applicable to more complex agentic systems such as research assistants, code generators, and autonomous computers.

Resources

Code and data are released at https://github.com/unimpor/T3. The paper is available on arXiv: https://arxiv.org/abs/2603.12109.

@article{zou2026information,
  title={On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents},
  author={Zou, Deyu and Chen, Yongqiang and Feng, Fan and Li, Mufei and Li, Pan and Gong, Yu and Cheng, James},
  journal={arXiv preprint arXiv:2603.12109},
  year={2026}
}
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Reinforcement LearningLLM Agentsagentic RLactive reasoningadvantage reweightingAREWinformation self-locking
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.