Artificial Intelligence 8 min read

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

This article breaks down the DeepAgent paper, explaining its novel "main model + auxiliary model" architecture, the memory‑folding mechanism that compresses long‑context reasoning, and the ToolPO reinforcement strategy that enables efficient tool discovery and usage.

360 Zhihui Cloud Developer

Nov 20, 2025

How DeepAgent Redefines AI Agents with Memory Folding and ToolPO

Problem Statement

Traditional AI agents follow fixed pipelines (plan → search → execute), which leads to three major limitations:

Lack of autonomy : agents cannot discover or incorporate new tools during a task.

Context explosion : as interaction steps increase, the dialogue history grows beyond the model's context window, causing forgetting of important information.

High tool‑learning cost : training agents to use thousands of tools is unstable and expensive.

DeepAgent Architecture

DeepAgent solves these issues with a collaborative "main model + auxiliary model" design.

Main Reasoning Model (LRM) : acts as the commander, continuously analyses the task, decides when to invoke tools, and determines when to trigger memory folding. It maintains a global perspective of the entire task.

Auxiliary Large Model : supports the commander by (1) summarising lengthy tool documentation, (2) denoising tool outputs, and (3) performing the memory‑folding operation.

Memory Folding Mechanism

When the LRM emits the special token <fold_thought>, the auxiliary model compresses the accumulated dialogue into a structured "memory card" consisting of three fields:

Scenario Memory : high‑level overview of past decisions, milestones and overall progress.

Working Memory : current goals, encountered challenges and next steps.

Tool Memory : records of used tools, their outcomes and lessons learned.

The raw context is discarded, leaving only these concise records, which prevents context overflow and enables the agent to continue long‑horizon tasks.

ToolPO: Reinforcement Learning for Tool Use

ToolPO trains the agent with a dual‑reward signal:

Global task reward : evaluates the final success of the whole task.

Local behavior reward : evaluates the correctness of each tool call and the timing of memory folding.

Rewards are assigned at the token level, separating global advantage (overall direction) from local advantage (specific tool‑call actions). Training is low‑cost because tool APIs are simulated by another LLM instead of real services, reducing latency and instability.

Experimental Results

DeepAgent was evaluated on eight benchmarks covering tool usage, web navigation and virtual‑environment tasks. It consistently outperformed baseline agents, demonstrating:

Dynamic tool discovery without pre‑binding, increasing autonomy.

Effective mitigation of long‑context issues via memory folding.

Efficient, low‑cost training through simulated tool environments.

Key Takeaways

The "main + auxiliary" split lets the reasoning model focus on high‑level planning while the auxiliary model handles heavy‑weight processing.

Memory folding provides a practical solution to the context‑window limitation of large language models.

ToolPO’s fine‑grained reward attribution accelerates stable learning of complex tool‑use behaviours.

References

Paper: http://arxiv.org/abs/2510.21618

Code repository: https://github.com/RUC-NLPIR/DeepAgent

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Agents large language models reinforcement learning tool usage memory folding ToolPO

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.