How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools
DeepAgent is a new end‑to‑end reasoning agent that unifies autonomous thinking, dynamic tool search, and execution, handling over 16,000 real APIs, supporting embodied environments and research assistance, and achieving state‑of‑the‑art results across multiple benchmarks through its unified reasoning core, memory‑folding mechanisms, structured memory, and the ToolPO training framework.
Paper Overview
DeepAgent: A General Reasoning Agent with Scalable Toolsets (arXiv: https://arxiv.org/abs/2510.21618). Source code is available at https://github.com/Rednote-DeepExperience/DeepAgent. The model integrates autonomous reasoning, dynamic tool discovery, and execution in a single coherent workflow.
Motivation
Traditional workflow constraints: Existing frameworks such as ReAct or Plan‑and‑Solve follow a fixed “think‑act‑observe” loop, lack global task awareness, and require pre‑specified tools.
Limited toolsets in current agents: Recent agents (e.g., Search‑o1, WebThinker) support only a few tools such as search or browsing, which is insufficient for real‑world diversity.
Key Capabilities
Massive tool handling: DeepAgent can autonomously search, filter, and invoke the most suitable tool from a library of over 16,000 RapidAPI services, even when some APIs are unavailable (simulated by an LLM).
Embodied intelligence: In environments like ALFWorld, the agent uses a plug‑in action set (move, observe, pick) to complete goal‑directed tasks.
Research assistant: Equipped with web search, content extraction, code execution, visual QA, and file processing tools to provide comprehensive research support.
Core Design
Unified Autonomous Reasoning Core
The central LLM drives a continuous “thought chain”. When external interaction is required, it emits special commands such as <tool_search>…</tool_search> to locate tools and <tool_call>…</tool_call> to invoke them. An auxiliary LLM handles lengthy tool documentation and filters noisy tool outputs, keeping the main reasoning focused.
Memory Folding Mechanism
At any critical point the agent can issue a <fold_thought> command. An auxiliary model compresses the interaction history into a structured summary, reducing computation and allowing the agent to reconsider strategies for long‑horizon tasks.
Structured Memory System
Episodic Memory: Records high‑level milestones, key decisions, and major events for long‑term reflection.
Working Memory: Short‑term cache for current sub‑goals, immediate challenges, and next actions, ensuring continuity across folds.
Tool Memory: A self‑updating JSON‑based handbook that logs usage patterns, successes, and failures, enabling continual optimization of tool‑selection strategies.
ToolPO Training Framework
DeepAgent is trained with Tool Policy Optimization (ToolPO), an end‑to‑end reinforcement learning method for universal tool use. Two innovations improve training stability:
LLM tool simulator: Replaces costly real‑API calls with a large model that mimics API responses, creating a stable, low‑cost training environment.
Dual‑advantage attribution: Combines a global reward based on final task success with a tool‑specific reward for correct API calls. Advantage‑based credit assignment ensures only the tokens that triggered a tool call receive the tool reward.
Experimental Evaluation
DeepAgent was evaluated on eight benchmarks covering generic tool use and downstream applications such as TMDB, ToolBench, ALFWorld, WebShop, and GAIA.
TMDB: Success rate 89.0% vs. 55.0% for prior models.
ToolBench (≈16k tools): Success rate 64.0%, far surpassing traditional agents.
Downstream tasks: State‑of‑the‑art scores on ALFWorld, WebShop, and GAIA (e.g., GAIA score 53.3).
Ablation studies: Removing ToolPO training caused the largest performance drop; disabling memory folding hurt long‑term GAIA tasks; omitting the tool simulator or advantage attribution also degraded results.
Dynamic tool discovery: The “think‑while‑search” strategy consistently outperformed one‑shot retrieval, especially with very large tool libraries.
Scalability: Performance improved with larger reasoning models (30B and 235B parameters), confirming model‑agnostic generality.
Conclusion
DeepAgent’s unified reasoning flow, autonomous memory management, and efficient ToolPO training set a new benchmark for building general, powerful AI agents capable of coherent long‑range planning, dynamic tool use, and adaptation to complex real‑world environments.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
