How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

DeepAgent is a new end‑to‑end reasoning agent that unifies autonomous thinking, dynamic tool search, and execution, handling over 16,000 real APIs, supporting embodied environments and research assistance, and achieving state‑of‑the‑art results across multiple benchmarks through its unified reasoning core, memory‑folding mechanisms, structured memory, and the ToolPO training framework.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
How DeepAgent Achieves End‑to‑End Reasoning with 16,000+ Scalable Tools

Paper Overview

DeepAgent: A General Reasoning Agent with Scalable Toolsets (arXiv: https://arxiv.org/abs/2510.21618). Source code is available at https://github.com/Rednote-DeepExperience/DeepAgent. The model integrates autonomous reasoning, dynamic tool discovery, and execution in a single coherent workflow.

Motivation

Traditional workflow constraints: Existing frameworks such as ReAct or Plan‑and‑Solve follow a fixed “think‑act‑observe” loop, lack global task awareness, and require pre‑specified tools.

Limited toolsets in current agents: Recent agents (e.g., Search‑o1, WebThinker) support only a few tools such as search or browsing, which is insufficient for real‑world diversity.

Key Capabilities

Massive tool handling: DeepAgent can autonomously search, filter, and invoke the most suitable tool from a library of over 16,000 RapidAPI services, even when some APIs are unavailable (simulated by an LLM).

Embodied intelligence: In environments like ALFWorld, the agent uses a plug‑in action set (move, observe, pick) to complete goal‑directed tasks.

Research assistant: Equipped with web search, content extraction, code execution, visual QA, and file processing tools to provide comprehensive research support.

Core Design

Unified Autonomous Reasoning Core

The central LLM drives a continuous “thought chain”. When external interaction is required, it emits special commands such as <tool_search>…</tool_search> to locate tools and <tool_call>…</tool_call> to invoke them. An auxiliary LLM handles lengthy tool documentation and filters noisy tool outputs, keeping the main reasoning focused.

Memory Folding Mechanism

At any critical point the agent can issue a <fold_thought> command. An auxiliary model compresses the interaction history into a structured summary, reducing computation and allowing the agent to reconsider strategies for long‑horizon tasks.

Structured Memory System

Episodic Memory: Records high‑level milestones, key decisions, and major events for long‑term reflection.

Working Memory: Short‑term cache for current sub‑goals, immediate challenges, and next actions, ensuring continuity across folds.

Tool Memory: A self‑updating JSON‑based handbook that logs usage patterns, successes, and failures, enabling continual optimization of tool‑selection strategies.

ToolPO Training Framework

DeepAgent is trained with Tool Policy Optimization (ToolPO), an end‑to‑end reinforcement learning method for universal tool use. Two innovations improve training stability:

LLM tool simulator: Replaces costly real‑API calls with a large model that mimics API responses, creating a stable, low‑cost training environment.

Dual‑advantage attribution: Combines a global reward based on final task success with a tool‑specific reward for correct API calls. Advantage‑based credit assignment ensures only the tokens that triggered a tool call receive the tool reward.

Experimental Evaluation

DeepAgent was evaluated on eight benchmarks covering generic tool use and downstream applications such as TMDB, ToolBench, ALFWorld, WebShop, and GAIA.

TMDB: Success rate 89.0% vs. 55.0% for prior models.

ToolBench (≈16k tools): Success rate 64.0%, far surpassing traditional agents.

Downstream tasks: State‑of‑the‑art scores on ALFWorld, WebShop, and GAIA (e.g., GAIA score 53.3).

Ablation studies: Removing ToolPO training caused the largest performance drop; disabling memory folding hurt long‑term GAIA tasks; omitting the tool simulator or advantage attribution also degraded results.

Dynamic tool discovery: The “think‑while‑search” strategy consistently outperformed one‑shot retrieval, especially with very large tool libraries.

Scalability: Performance improved with larger reasoning models (30B and 235B parameters), confirming model‑agnostic generality.

Conclusion

DeepAgent’s unified reasoning flow, autonomous memory management, and efficient ToolPO training set a new benchmark for building general, powerful AI agents capable of coherent long‑range planning, dynamic tool use, and adaptation to complex real‑world environments.

AI agentsTool Integrationreinforcement learninggeneral AIdeep reasoningmemory folding
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.