How 8 Agents Can Converge Stably: Trust‑Region Constraints Reshape Multi‑Agent LLM Workflows

The paper introduces TeamTR, a trust‑region fine‑tuning framework that mitigates compounding occupancy shift in multi‑agent LLM workflows by fresh rollout sampling and token‑level KL constraints, achieving stable performance gains of up to 7.1% overall and dramatic improvements on large‑scale tasks such as AIME24.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How 8 Agents Can Converge Stably: Trust‑Region Constraints Reshape Multi‑Agent LLM Workflows

Problem

In multi‑agent LLM workflows (planner, solver, critic, executor sharing a common context) updating one agent changes the distribution of inputs seen by downstream agents. Reusing the rollout generated before the update creates a “compounding occupancy shift”, where the bias accumulates along the update order.

Theoretical Insight

The paper proves that the penalty from stale‑occupancy evaluation grows quadratically with the number of agents, O(n²). If after each component update a fresh rollout is sampled (intermediate occupancy), the dominant penalty reduces to linear, O(n).

TeamTR Method

TeamTR performs stage‑wise fine‑tuning with two mechanisms:

Sample a fresh rollout using the partially updated team.

Construct a surrogate objective for the target agent.

Apply a token‑level KL penalty or early‑stop to bound the update (trust‑region).

Refresh the trajectory before proceeding to the next agent.

This keeps each agent’s local improvement while preventing uncontrolled drift of the overall team distribution.

Experimental Results

Benchmarks include mathematical reasoning (AIME24, AIME25), logical reasoning, active reasoning and planning. TeamTR yields an average gain of 7.1 % over single‑agent fine‑tuning and multi‑agent baselines.

On AIME24, a 3 × Qwen3‑8B team improves from 71.1 % to 88.1 % (stale‑gap ↓ 0.31 → 0.08). A heterogeneous 8B + 14B + 32B team improves from 77.8 % to 92.5 %.

When scaling to eight agents, TeamTR reaches 87.9 % versus 58.7 % for naive sequential training, demonstrating that more agents do not guarantee better collaboration without distribution control.

Ablation studies: removing the KL penalty or the fresh‑rollout step degrades stability; removing both yields the worst performance.

Token‑level KL monitoring shows out‑of‑region updates of 2 % for TeamTR on AIME25, compared with 21 % (DAPO), 44 % (GRPO) and 60 % (PPO).

Component Replacement

Replacing a 1.5 B agent in a Qwen2.5‑Instruct team with a Qwen3‑8B model causes a performance shock. TeamTR’s Stage‑0 alignment mitigates the shock, delivering +27 % (AIME24) and +24 % (ARBench‑DC) improvements.

Conclusion

Training multi‑agent LLM workflows requires controlling distribution drift rather than only strengthening individual agents. TeamTR integrates fresh rollout sampling and token‑level trust‑region constraints, providing a monitorable, scalable framework that supports component swapping.

Paper: https://arxiv.org/abs/2605.15207

Code: https://github.com/Yydc/TeamTR

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

fine-tuningAI coordinationmulti-agent LLMtrust regionoccupancy shiftTeamTR
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.