Machine Learning Algorithms & Natural Language Processing
May 8, 2026 · Artificial Intelligence
T²PO: Uncertainty‑Guided Exploration Control for Stable Multi‑Turn Agent RL
The paper identifies inefficient exploration, termed "hesitation," as the root cause of instability in multi‑turn reinforcement learning for LLM agents and introduces T²PO, an uncertainty‑driven token‑ and turn‑level control framework that markedly improves training stability and performance on benchmarks such as WebShop, ALFWorld, and Search QA.
LLM agentsT2POexploration control
0 likes · 16 min read
