T²PO: Uncertainty‑Guided Exploration Control for Stable Multi‑Turn Agent RL
The paper identifies inefficient exploration, termed "hesitation," as the root cause of instability in multi‑turn reinforcement learning for LLM agents and introduces T²PO, an uncertainty‑driven token‑ and turn‑level control framework that markedly improves training stability and performance on benchmarks such as WebShop, ALFWorld, and Search QA.
Problem and Motivation
Multi‑turn reinforcement learning (RL) is essential for enhancing large language model (LLM) agents in complex interactive tasks, yet training often suffers from severe instability and occasional collapse. Existing remedies—fine‑grained credit assignment, process‑reward modeling, or trajectory filtering—are either too coarse or rely on indirect reward shaping, making them highly sensitive to hyper‑parameters.
Root‑Cause Diagnosis
The UCLA and Amazon research team (Haixin Wang, corresponding author Hejie Cui) pinpointed inefficient exploration, or "hesitation," as the fundamental issue. Hesitation manifests at two levels:
Token level: agents generate long token sequences whose information gain quickly saturates while sampling noise accumulates, leading to over‑thinking.
Turn level: agents deviate early from successful action spaces and then repeat ineffective turns, injecting noise into credit assignment and causing high‑variance gradient updates.
Proposed Solution: T²PO
T²PO (Token‑ and Turn‑level Policy Optimization) is an uncertainty‑guided exploration control framework that directly monitors and intervenes in exploration rather than shaping rewards.
Core Components
Cold‑start: Uses Rejection‑based Fine‑Tuning (RFT) to filter out failed trajectories and fine‑tune on high‑quality ones, providing a stable initialization.
Attention Signal: Constructs a self‑calibrated uncertainty signal M_t by fusing entropy and confidence, serving as a real‑time monitor during trajectory collection.
Uncertainty Quantification & Hesitation Detection:
Token‑level hesitation: Triggered when marginal uncertainty change falls below a threshold, indicating that further token generation adds noise without reducing uncertainty.
Turn‑level hesitation: Detected when a sequence of turns exhibits similar uncertainty patterns, suggesting a loop of useless actions.
Token‑level Control – Thinking Intervention: When marginal uncertainty is low, the system forcibly ends the current thinking chain and forces the model to output an action, preventing wasteful token generation.
Turn‑level Control – Dynamic Resampling: Identifies low‑impact turns and discards them, then resamples from a more promising historical context to avoid wasting entire trajectories.
Stability Enhancements:
Memory context window to alleviate long‑range interaction pressure.
Strict format penalty to ensure structured outputs.
State‑of‑the‑art policy‑update algorithm for final optimization.
Experimental Setup
Evaluation was performed on three representative multi‑turn agent benchmarks:
WebShop: Simulated online shopping requiring multi‑step decision making and retrieval.
ALFWorld: Text‑based household robot tasks such as "place the apple on the table."
Search QA: Complex information‑retrieval question answering with multi‑turn reasoning.
The primary metrics were success rate and training stability, monitored via success‑rate curves, KL‑divergence explosions, and gradient‑norm fluctuations.
Figure 1: Baseline methods exhibit rapid success‑rate decline and exploding internal signals across random seeds, while T²PO maintains stable, rising performance.
Results
Compared with state‑of‑the‑art baselines, T²PO achieved significant improvements in both stability and task performance. Across all three environments, T²PO prevented training collapse, reduced token consumption on successful trajectories, and required roughly 25 % fewer interaction turns to complete tasks.
Figure 5: (a) Consistent performance gains without collapse; (b) Lower token usage for successful trajectories; (c) ~25 % fewer turns needed.
Additional analyses (Figures 6 and 7) showed that T²PO’s output length and truncation rates were more favorable than the GiGPO baseline under varying maximum response lengths.
Limitations and Future Work
The paper does not explicitly discuss limitations, but potential concerns include sensitivity to the uncertainty‑threshold hyper‑parameter, generalization to more open‑domain or larger‑scale environments, and the computational overhead of continuous uncertainty monitoring and dynamic resampling.
Conclusion
T²PO contributes a novel, uncertainty‑driven exploration control paradigm that directly mitigates hesitation, thereby stabilizing multi‑turn RL training for LLM agents. The framework offers a reproducible baseline and opens avenues for further research into fine‑grained exploration governance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
