Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench

WildToolBench reveals that existing LLM tool‑calling benchmarks overlook real‑world user behavior, and a comprehensive evaluation of 58 models shows even the strongest agents achieve less than 15% session accuracy, highlighting a huge gap between reported performance and practical usability.

PaperAgent
PaperAgent
PaperAgent
Why LLM Tool‑Calling Benchmarks Miss Real Users: Introducing WildToolBench

Current large‑language‑model (LLM) tool‑calling benchmarks often assume idealized user queries and perfect experimental conditions, creating a large gap between reported scores and real user experience. To address this, the authors introduce WildToolBench , a benchmark built on authentic “wild” user interaction patterns.

Wild behavior patterns

Three core patterns are identified:

Composite tasks that require coordinated orchestration of multiple tools across several steps.

Hidden intents dispersed over multiple dialogue turns, demanding contextual inference.

Frequent switching between task queries, clarification, and casual conversation, forcing the LLM to adapt its strategy in real time.

Using these patterns, a dataset of 256 scenarios covering 1,024 individual tasks was constructed.

Resources

The benchmark, evaluation scripts, and data synthesis framework are fully open‑source:

Repository: https://github.com/yupeijei1997/WildToolBench

Paper: https://openreview.net/forum?id=yz7fL5vfpn

Evaluation on 58 LLMs

An extensive study evaluated 58 models, including closed‑source systems (Gemini, Claude, GPT series) and open‑source models (GLM‑4.5, Kimi‑K2). The highest session‑level accuracy observed was under 15%, and even the best models achieved below 60% task accuracy, overturning the optimistic view of current tool‑calling capabilities.

Key findings

Closed‑source models generally outperform open‑source ones, though top open‑source models are narrowing the gap.

Specialized tool‑calling models often lag behind generalist models because of limited generalization.

Models with stronger reasoning abilities excel in coordinated tool use and intent inference, indicating reasoning as a critical lever for improvement.

Challenge One: Planning tool‑calling topologies for composite tasks

Real‑world requests frequently involve multiple tools and steps. LLMs must not only select the appropriate tools but also schedule execution order, timing, and priority, adapting dynamically based on intermediate results. In the most complex mixed‑tool scenarios, the best observed task accuracy is only 25% and the optimal path selection rate stays below 43%.

Challenge Two: Inferring hidden intentions across multi‑turn dialogues

Users often reveal requirements gradually, embedding core intents within the conversation context. LLMs must continuously capture salient information, integrate it across turns, and infer unstated goals—a capability where current models frequently fall short.

Challenge Three: Real‑time strategy adjustment amid instruction switching

Users switch between task queries, clarification questions, and casual chit‑chat, demanding that LLMs seamlessly toggle between tool‑driven responses and pure knowledge answers while preserving the conversation thread. This flexibility is essential, yet task accuracy can drop by up to 30% when instruction types change abruptly.

Benchmark contributions

Beyond providing a tougher evaluation, WildToolBench offers structured dimensions and detailed error analyses (e.g., wrong tool selection, redundant calls). These insights give developers a concrete roadmap for improving LLM agents and help enterprises assess the true readiness of AI assistants for real‑world deployment.

Authors: Peijie Yu, Wei Liu, Yifan Yang, Jinjian Li, Zelong Zhang, Xiao Feng, Feng Zhang
Affiliations: Tencent HY, King’s College London
Paper: https://openreview.net/forum?id=yz7fL5vfpn
GitHub: https://github.com/yupeijei1997/WildToolBench
LLMbenchmarkevaluationAgentic AI
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.