Artificial Intelligence 8 min read

Agent-Dice: Geometric Consensus Filtering Beats Catastrophic Forgetting in LLM Agents

Agent-Dice introduces a geometric consensus filtering and curvature‑based importance weighting framework that disentangles knowledge updates, preventing catastrophic forgetting in large‑language‑model agents while enhancing plasticity, and demonstrates superior stability‑plasticity trade‑offs on GUI and tool‑use benchmarks across multiple base models.

Machine Learning Algorithms & Natural Language Processing

Apr 10, 2026

Agent-Dice: Geometric Consensus Filtering Beats Catastrophic Forgetting in LLM Agents

1. Problem: Knowledge Update Conflict

Large language model (LLM) agents are moving from single‑task to multi‑task, cross‑environment scenarios, but they suffer from the stability‑plasticity dilemma: learning new skills often overwrites previously acquired abilities, a phenomenon known as catastrophic forgetting. The core issue is the lack of distinction between "generic knowledge" (e.g., clicking an icon) and "conflicting knowledge" (e.g., platform‑specific settings), which leads to interference when parameters are updated jointly.

2. Core Solution: Agent-Dice’s Geometric Game

Step 1 – Geometric Consensus Filtering : The algorithm first records the sign (positive or negative) of each task’s gradient on every parameter. A direction is considered a "geometric consensus" only if the overwhelming majority of tasks agree on that sign. Gradients that oppose this consensus are treated as "interfering updates" and are zeroed out, acting as a filter that removes gradient noise responsible for forgetting.

Step 2 – Curvature‑Based Importance Weighting : After filtering, the remaining updates are weighted according to their magnitude, which serves as a proxy for curvature or importance in the loss landscape. Larger updates indicate parameters that are more critical for the current task. A softmax‑style normalization assigns higher weights to representative tasks within the consensus set, ensuring that the fused parameters preserve high‑confidence features while still following the majority direction.

These two stages together balance stability (by eliminating conflicting gradients) and plasticity (by emphasizing important new knowledge).

3. Experimental Performance: GUI and Tool‑use “All‑rounder”

Agent-Dice was evaluated on two domains: GUI agents and tool‑use agents. For GUI agents, benchmarks AITZ, AndroidControl, and GUI‑Odyssey were used, reporting Type accuracy, single‑step success rate (SR), trajectory success rate (TSR), and average Z‑score (AvgZ). For tool‑use, a subset of the ToolACE dataset measured function name prediction (Func), full function‑parameter prediction (Full), and AvgZ.

Across multiple base models such as OS‑Atlas‑Pro‑7B and Qwen3‑VL‑8B, Agent‑Dice achieved the highest AvgZ scores, significantly outperforming traditional incremental learning methods. It mitigated forgetting of earlier tasks (e.g., AITZ, AndroidControl) while improving learning of new tasks (e.g., GUI‑Odyssey), demonstrating both higher stability and enhanced plasticity.

In tool‑use experiments, models equipped with varying native tool‑calling abilities all showed robust performance, with Agent‑Dice leading baselines in function‑name and parameter prediction accuracy, confirming its generality and resistance to interference in complex multi‑stage tasks.

4. Conclusion

Agent‑Dice provides a rigorous theoretical and practical framework for building "digital twins" that can continuously evolve without forgetting. By disentangling knowledge updates through geometric consensus filtering and importance weighting, the method enables LLM agents to learn smarter and grow more robustly.

Paper: https://arxiv.org/abs/2601.03641

Open‑source code: https://github.com/Wuzheng02/Agent-Dice

GUI LLM Agent Continual Learning Catastrophic Forgetting Geometric Consensus Filtering Tool-use

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.