Artificial Intelligence 16 min read

T²PO: Uncertainty‑Guided Exploration Control for Stable Multi‑Turn Agent RL

The paper identifies inefficient exploration, termed "hesitation," as the root cause of instability in multi‑turn reinforcement learning for LLM agents and introduces T²PO, an uncertainty‑driven token‑ and turn‑level control framework that markedly improves training stability and performance on benchmarks such as WebShop, ALFWorld, and Search QA.

Machine Learning Algorithms & Natural Language Processing

May 8, 2026

T²PO: Uncertainty‑Guided Exploration Control for Stable Multi‑Turn Agent RL

Problem and Motivation

Multi‑turn reinforcement learning (RL) is essential for enhancing large language model (LLM) agents in complex interactive tasks, yet training often suffers from severe instability and occasional collapse. Existing remedies—fine‑grained credit assignment, process‑reward modeling, or trajectory filtering—are either too coarse or rely on indirect reward shaping, making them highly sensitive to hyper‑parameters.

Root‑Cause Diagnosis

The UCLA and Amazon research team (Haixin Wang, corresponding author Hejie Cui) pinpointed inefficient exploration, or "hesitation," as the fundamental issue. Hesitation manifests at two levels:

Token level: agents generate long token sequences whose information gain quickly saturates while sampling noise accumulates, leading to over‑thinking.

Turn level: agents deviate early from successful action spaces and then repeat ineffective turns, injecting noise into credit assignment and causing high‑variance gradient updates.

Proposed Solution: T²PO

T²PO (Token‑ and Turn‑level Policy Optimization) is an uncertainty‑guided exploration control framework that directly monitors and intervenes in exploration rather than shaping rewards.

Core Components

Cold‑start: Uses Rejection‑based Fine‑Tuning (RFT) to filter out failed trajectories and fine‑tune on high‑quality ones, providing a stable initialization.

Attention Signal: Constructs a self‑calibrated uncertainty signal M_t by fusing entropy and confidence, serving as a real‑time monitor during trajectory collection.

Uncertainty Quantification & Hesitation Detection:

Token‑level hesitation: Triggered when marginal uncertainty change falls below a threshold, indicating that further token generation adds noise without reducing uncertainty.

Turn‑level hesitation: Detected when a sequence of turns exhibits similar uncertainty patterns, suggesting a loop of useless actions.

Token‑level Control – Thinking Intervention: When marginal uncertainty is low, the system forcibly ends the current thinking chain and forces the model to output an action, preventing wasteful token generation.

Turn‑level Control – Dynamic Resampling: Identifies low‑impact turns and discards them, then resamples from a more promising historical context to avoid wasting entire trajectories.

Stability Enhancements:

Memory context window to alleviate long‑range interaction pressure.

Strict format penalty to ensure structured outputs.

State‑of‑the‑art policy‑update algorithm for final optimization.

Experimental Setup

Evaluation was performed on three representative multi‑turn agent benchmarks:

WebShop: Simulated online shopping requiring multi‑step decision making and retrieval.

ALFWorld: Text‑based household robot tasks such as "place the apple on the table."

Search QA: Complex information‑retrieval question answering with multi‑turn reasoning.

The primary metrics were success rate and training stability, monitored via success‑rate curves, KL‑divergence explosions, and gradient‑norm fluctuations.

Figure 1: Baseline methods exhibit rapid success‑rate decline and exploding internal signals across random seeds, while T²PO maintains stable, rising performance.

Results

Compared with state‑of‑the‑art baselines, T²PO achieved significant improvements in both stability and task performance. Across all three environments, T²PO prevented training collapse, reduced token consumption on successful trajectories, and required roughly 25 % fewer interaction turns to complete tasks.

Figure 5: (a) Consistent performance gains without collapse; (b) Lower token usage for successful trajectories; (c) ~25 % fewer turns needed.

Additional analyses (Figures 6 and 7) showed that T²PO’s output length and truncation rates were more favorable than the GiGPO baseline under varying maximum response lengths.

Limitations and Future Work

The paper does not explicitly discuss limitations, but potential concerns include sensitivity to the uncertainty‑threshold hyper‑parameter, generalization to more open‑domain or larger‑scale environments, and the computational overhead of continuous uncertainty monitoring and dynamic resampling.

Conclusion

T²PO contributes a novel, uncertainty‑driven exploration control paradigm that directly mitigates hesitation, thereby stabilizing multi‑turn RL training for LLM agents. The framework offers a reproducible baseline and opens avenues for further research into fine‑grained exploration governance.

Figure 1: Training instability of baseline methods

Figure 2: Overview of the uncertainty‑guided exploration control framework

Figure 3: Geometry of the fused uncertainty signal M_t

Figure 4: Dynamics of the self‑calibrated uncertainty signal and thinking‑intervention trigger

Figure 5: Performance and exploration‑efficiency evaluation

Figure 6: Impact of maximum response length on output length and truncation ratio

Figure 7: Additional exploration‑efficiency analysis on ALFWorld

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

uncertainty LLM agents training stability exploration control multi-turn RL T2PO

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.