How 8 Agents Can Converge Stably: Trust‑Region Constraints Reshape Multi‑Agent LLM Workflows
The paper introduces TeamTR, a trust‑region fine‑tuning framework that mitigates compounding occupancy shift in multi‑agent LLM workflows by fresh rollout sampling and token‑level KL constraints, achieving stable performance gains of up to 7.1% overall and dramatic improvements on large‑scale tasks such as AIME24.
Problem
In multi‑agent LLM workflows (planner, solver, critic, executor sharing a common context) updating one agent changes the distribution of inputs seen by downstream agents. Reusing the rollout generated before the update creates a “compounding occupancy shift”, where the bias accumulates along the update order.
Theoretical Insight
The paper proves that the penalty from stale‑occupancy evaluation grows quadratically with the number of agents, O(n²). If after each component update a fresh rollout is sampled (intermediate occupancy), the dominant penalty reduces to linear, O(n).
TeamTR Method
TeamTR performs stage‑wise fine‑tuning with two mechanisms:
Sample a fresh rollout using the partially updated team.
Construct a surrogate objective for the target agent.
Apply a token‑level KL penalty or early‑stop to bound the update (trust‑region).
Refresh the trajectory before proceeding to the next agent.
This keeps each agent’s local improvement while preventing uncontrolled drift of the overall team distribution.
Experimental Results
Benchmarks include mathematical reasoning (AIME24, AIME25), logical reasoning, active reasoning and planning. TeamTR yields an average gain of 7.1 % over single‑agent fine‑tuning and multi‑agent baselines.
On AIME24, a 3 × Qwen3‑8B team improves from 71.1 % to 88.1 % (stale‑gap ↓ 0.31 → 0.08). A heterogeneous 8B + 14B + 32B team improves from 77.8 % to 92.5 %.
When scaling to eight agents, TeamTR reaches 87.9 % versus 58.7 % for naive sequential training, demonstrating that more agents do not guarantee better collaboration without distribution control.
Ablation studies: removing the KL penalty or the fresh‑rollout step degrades stability; removing both yields the worst performance.
Token‑level KL monitoring shows out‑of‑region updates of 2 % for TeamTR on AIME25, compared with 21 % (DAPO), 44 % (GRPO) and 60 % (PPO).
Component Replacement
Replacing a 1.5 B agent in a Qwen2.5‑Instruct team with a Qwen3‑8B model causes a performance shock. TeamTR’s Stage‑0 alignment mitigates the shock, delivering +27 % (AIME24) and +24 % (ARBench‑DC) improvements.
Conclusion
Training multi‑agent LLM workflows requires controlling distribution drift rather than only strengthening individual agents. TeamTR integrates fresh rollout sampling and token‑level trust‑region constraints, providing a monitorable, scalable framework that supports component swapping.
Paper: https://arxiv.org/abs/2605.15207
Code: https://github.com/Yydc/TeamTR
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
