Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench
WildToolBench reveals that existing LLM tool‑calling benchmarks overlook real‑world user behavior, and a comprehensive evaluation of 58 models shows even the strongest agents achieve less than 15% session accuracy, highlighting a huge gap between reported performance and practical usability.
Current large‑language‑model (LLM) tool‑calling benchmarks often assume idealized user queries and perfect experimental conditions, creating a large gap between reported scores and real user experience. To address this, the authors introduce WildToolBench , a benchmark built on authentic “wild” user interaction patterns.
Wild behavior patterns
Three core patterns are identified:
Composite tasks that require coordinated orchestration of multiple tools across several steps.
Hidden intents dispersed over multiple dialogue turns, demanding contextual inference.
Frequent switching between task queries, clarification, and casual conversation, forcing the LLM to adapt its strategy in real time.
Using these patterns, a dataset of 256 scenarios covering 1,024 individual tasks was constructed.
Resources
The benchmark, evaluation scripts, and data synthesis framework are fully open‑source:
Repository: https://github.com/yupeijei1997/WildToolBench
Paper: https://openreview.net/forum?id=yz7fL5vfpn
Evaluation on 58 LLMs
An extensive study evaluated 58 models, including closed‑source systems (Gemini, Claude, GPT series) and open‑source models (GLM‑4.5, Kimi‑K2). The highest session‑level accuracy observed was under 15%, and even the best models achieved below 60% task accuracy, overturning the optimistic view of current tool‑calling capabilities.
Key findings
Closed‑source models generally outperform open‑source ones, though top open‑source models are narrowing the gap.
Specialized tool‑calling models often lag behind generalist models because of limited generalization.
Models with stronger reasoning abilities excel in coordinated tool use and intent inference, indicating reasoning as a critical lever for improvement.
Challenge One: Planning tool‑calling topologies for composite tasks
Real‑world requests frequently involve multiple tools and steps. LLMs must not only select the appropriate tools but also schedule execution order, timing, and priority, adapting dynamically based on intermediate results. In the most complex mixed‑tool scenarios, the best observed task accuracy is only 25% and the optimal path selection rate stays below 43%.
Challenge Two: Inferring hidden intentions across multi‑turn dialogues
Users often reveal requirements gradually, embedding core intents within the conversation context. LLMs must continuously capture salient information, integrate it across turns, and infer unstated goals—a capability where current models frequently fall short.
Challenge Three: Real‑time strategy adjustment amid instruction switching
Users switch between task queries, clarification questions, and casual chit‑chat, demanding that LLMs seamlessly toggle between tool‑driven responses and pure knowledge answers while preserving the conversation thread. This flexibility is essential, yet task accuracy can drop by up to 30% when instruction types change abruptly.
Benchmark contributions
Beyond providing a tougher evaluation, WildToolBench offers structured dimensions and detailed error analyses (e.g., wrong tool selection, redundant calls). These insights give developers a concrete roadmap for improving LLM agents and help enterprises assess the true readiness of AI assistants for real‑world deployment.
Authors: Peijie Yu, Wei Liu, Yifan Yang, Jinjian Li, Zelong Zhang, Xiao Feng, Feng Zhang
Affiliations: Tencent HY, King’s College London
Paper: https://openreview.net/forum?id=yz7fL5vfpn
GitHub: https://github.com/yupeijei1997/WildToolBenchHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
