How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents

This article reviews two recent ICLR 2026 papers—M²‑Miner, a multi‑agent Monte‑Carlo Tree Search framework for low‑cost mobile GUI data mining, and IGPO, an information‑gain‑based reinforcement‑learning method that provides dense rewards for multi‑turn search agents—detailing their designs, experiments, and open‑source releases.

AntTech
AntTech
AntTech
How Multi‑Agent MCTS and Information‑Gain Rewards Are Transforming Mobile GUI and Search Agents

M²‑Miner: Multi‑Agent Enhanced Monte‑Carlo Tree Search for Mobile GUI Data Mining

Training high‑performing mobile GUI agents requires large collections of intent‑trajectory pairs, i.e., annotated user‑interaction sequences. Manual labeling and prior data‑mining pipelines suffer from three intertwined problems:

High construction cost – each trajectory must be collected and verified by humans.

Poor data quality – noisy or incomplete trajectories reduce the fidelity of the learned policy.

Insufficient diversity – limited intent coverage hampers generalisation across apps.

To overcome these issues, the authors design M²‑Miner , a low‑cost, fully automated framework that leverages Monte‑Carlo Tree Search (MCTS) together with three cooperating agents:

InferAgent explores the GUI state space by performing page‑navigation actions guided by MCTS rollouts.

OrchestraAgent accelerates the search by pruning low‑value branches and re‑using successful sub‑paths.

JudgeAgent evaluates completed trajectories, assigning a quality score based on task success and interaction smoothness.

The framework introduces an intent‑recovery strategy: after a successful rollout, the system back‑tracks to identify intermediate states that correspond to distinct user intents, thereby extracting additional intent‑trajectory pairs without extra environment interaction. This step directly addresses the diversity problem. A progressive model‑in‑the‑loop training loop is then applied: the current GUI policy is fine‑tuned on the newly mined data, which in turn improves InferAgent’s navigation efficiency, creating a positive feedback cycle. Extensive experiments on several public mobile GUI benchmarks (e.g., RICO, MobileApp‑Gym) show that agents fine‑tuned with M²‑Miner data achieve state‑of‑the‑art success rates, surpassing prior data‑mining baselines by a noticeable margin. The authors release the full codebase and dataset to enable reproducibility. Website: https://larry225.github.io/M2-Miner/ GitHub: https://github.com/ant-research/M2-Miner arXiv: https://arxiv.org/abs/2602.05429

Information‑Gain‑Based Policy Optimization (IGPO) for Multi‑Turn Search Agents

Large‑language‑model (LLM) agents that perform multi‑turn search must reason iteratively and acquire external knowledge. Existing reinforcement‑learning (RL) approaches typically assign a single, result‑level reward after the final answer is produced. This sparse reward regime creates three critical failure modes in long‑horizon tasks:

Advantage collapse : identical rewards for all rollouts eliminate useful gradient signals.

Credit‑assignment deficiency : intermediate decisions that contribute to the final answer receive no feedback.

Low sample efficiency : each rollout yields only one scalar reward, wasting the information contained in intermediate steps.

IGPO addresses these issues by modelling each interaction turn as an incremental information‑gain process toward the ground‑truth answer. The turn‑level intrinsic reward is defined as the marginal increase in the probability that the model will generate the correct answer after observing the new observation:

r_t = P(answer | history_{t}) - P(answer | history_{t-1})

Because the probability is obtained directly from the model’s own belief state, IGPO does not require external reward models or costly Monte‑Carlo estimations. The intrinsic turn‑level rewards are summed with the conventional result‑level reward, yielding a dense supervision signal that guides both short‑term decision making and long‑term planning. Empirical evaluation on in‑domain benchmarks (e.g., Web‑Search, Fact‑Checking) and out‑of‑domain tasks demonstrates that IGPO consistently outperforms strong baselines such as REINFORCE‑based search agents and PPO with sparse rewards. Improvements are reported in both final answer accuracy and the number of rollouts needed to reach a target performance level, indicating higher data efficiency. GitHub: https://github.com/GuoqingWang1/IGPO arXiv: https://arxiv.org/abs/2510.14967 Hugging Face: https://huggingface.co/papers/2510.14967

open-sourcereinforcement learningMulti-agentLLM agentsMonte Carlo Tree SearchInformation GainGUI Data Mining
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.