Agent-World: Scaling Real-World Environments for Co‑Evolving Agents and Their Worlds
Agent-World introduces a universal training arena that automatically mines real‑world data from the internet to build over 1,900 diverse environments and 19,800 tools, then generates long‑horizon tasks through graph‑based and programmatic synthesis, creating a self‑evolving loop where agents are evaluated, diagnosed, and the environment is refined, achieving state‑of‑the‑art results on 23 benchmarks.
Motivation. Large language models can call hundreds of external tools, yet they still struggle with multi‑tool, long‑horizon tasks that require complex state management. Existing environment‑extension methods either rely on handcrafted tool databases or generate limited synthetic scenarios, leading to a gap between training and real‑world interaction.
Automatic environment mining. Agent‑World deploys a Deep Research Agent that selects real‑world themes (e.g., MCP servers, open‑source tool documentation) and crawls the web to collect raw environment data. The agent iteratively refines the data by search → browse → compile → validate, expanding both the scale (≈1,978 final environments) and the structural realism of the environment database.
Task synthesis. Two complementary pipelines generate evaluation tasks:
Graph‑based synthesis: a fully connected dependency graph of tools is built; random walks produce plausible tool‑call sequences, which are then turned into natural‑language questions and scored by a dedicated rubric.
Programmatic synthesis: an LLM writes Python scripts that solve a target problem; the script is executed to verify correctness, ensuring the task captures non‑linear reasoning.
Both pipelines produce long‑horizon tasks with an average of 15 interaction rounds, exposing planning, memory, and error‑recovery challenges.
Hierarchical environment taxonomy. The mined environments are clustered into a three‑level taxonomy (20 top‑level, 50 mid‑level, 1,978 leaf environments). This taxonomy enables balanced sampling across diverse domains and supports systematic difficulty scaling.
Self‑evolution training loop. After each training epoch, a balanced batch of new environments is sampled and new tasks are synthesized. Agents are evaluated on these tasks, and a diagnostic module analyses failure trajectories to rank weak environments. Targeted tasks are then generated for the identified weaknesses, and the environment database is enriched accordingly. This loop— train → evaluate → diagnose → generate → retrain —creates a closed‑form co‑evolution of agents and environments.
Experimental setup and benchmarks. Agent‑World was evaluated on 23 benchmark suites covering tool use (MCP‑Mark, BFCL V4, τ²‑Bench), general reasoning (MATH500, GSM8K, AIME, OlympiadBench), software engineering (WebWalkerQA, SWE‑Bench, Terminal‑Bench, GAIA, HLE), and knowledge (MMLU, SuperGPQA, MCP‑Universe). Baselines included state‑of‑the‑art closed‑source models (GPT‑5.2 High, Claude Sonnet‑4.5, Seed‑2.0) and leading open‑source models (DeepSeek‑V3.2‑685B, Qwen3‑235B‑A22B) as well as existing environment‑extension methods (EnvScaler, AWM, ScaleEnv). Agent‑World‑8B/14B consistently outperformed all baselines, e.g., achieving 55.8 % on MCP‑Mark versus 50 % for the strongest open‑source baseline, and surpassing a 685B DeepSeek model despite having fewer than half the parameters.
Scaling analysis. Performance improves monotonically with the number of training environments: from 0 to ~2,000 environments, reward curves rise steadily while policy entropy remains stable, indicating sustained exploration. Similarly, two rounds of the self‑evolution loop yield consistent gains across all benchmark groups, confirming that targeted data generation driven by diagnostic feedback is an effective mechanism for continual improvement.
Conclusions and outlook. Agent‑World demonstrates that (1) high‑fidelity, automatically mined environments are essential for training general agents; (2) a self‑evolving training arena that couples evaluation, diagnosis, and environment augmentation drives scalable performance gains; and (3) scaling environment diversity, task difficulty, and evolution rounds together forms a promising pathway toward truly general interactive AI. Future work will explore richer multimodal environments and tighter integration of training algorithms with the evolving ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
