Introducing VitaBench: A Real-World Benchmark for Complex LLM Agents

VitaBench is a newly released, highly realistic benchmark that evaluates large‑language‑model agents across three everyday scenarios—food ordering, restaurant dining, and travel planning—by quantifying reasoning, tool‑use, and interaction complexities, revealing a significant performance gap in current models.

DataFunTalk
DataFunTalk
DataFunTalk
Introducing VitaBench: A Real-World Benchmark for Complex LLM Agents

VitaBench Overview

Meituan’s LongCat team announced VitaBench (Versatile Interactive Tasks Benchmark), a large‑scale, life‑service‑oriented benchmark that tests LLM‑based agents on three high‑frequency real‑world scenarios: food delivery, restaurant dining, and travel planning.

Key Challenges in Existing Benchmarks

Tool ecosystem simplification : Prior benchmarks focus on single API calls, ignoring complex tool dependencies.

Insufficient information density : They lack multi‑source data such as temporal, commonsense, and user profile information.

Limited model exploration : Long policy documents constrain autonomous reasoning and long‑instruction compliance.

Missing interaction dynamics : Realistic user behavior, intent shifts, and multi‑turn dialogue are rarely modeled.

Three Dimensions of Task Complexity

Reasoning complexity : Requires integrating multi‑source information and planning task execution paths.

Tool complexity : Involves navigating a densely connected tool graph where nodes are APIs and edges represent dependencies.

Interaction complexity : Demands multi‑turn dialogue, intent clarification, and adaptive feedback.

Benchmark Construction

VitaBench builds a partially observable Markov decision process (POMDP) environment with 66 tools across the three scenarios, encoded as a directed graph. Tools are implemented as Python functions. Each task includes a unique user persona simulated by a GPT‑4.1‑based user simulator.

Task Design Process

Define 66 core APIs and their dependencies.

Construct the tool graph and encode domain rules.

Develop a user simulator that generates fuzzy, personalized requests.

Two‑stage task creation then adds user profiles, composite instructions, and verified environment data, resulting in 400 evaluation tasks (300 single‑scene, 100 cross‑scene).

Evaluation Methodology

All models use function‑call‑based agent architectures with official tool‑call formats.

User simulator runs on GPT‑4.1; evaluator runs on Claude‑3.7‑Sonnet.

Each task is executed four times (temperature 0.0) and metrics Avg@4, Pass@4, Pass^4 are computed.

Results

Even top‑performing models achieve only ~30% success on cross‑scene tasks (vs 48.3% on single‑scene). Multiple attempts improve Pass@4 (~60%) but Pass^4 remains near zero, indicating instability. “Thinking” models with chain‑of‑thought reasoning outperform non‑thinking models by 5–8 percentage points and require fewer interaction turns.

Ablation Studies

Reasoning complexity negatively correlates with success rate; cross‑scene tasks have ~10 reasoning points.

Tool graph size and density increase difficulty; cross‑scene tasks involve 66 tools and 512 edges.

Interaction complexity: removing user simulation boosts performance; realistic user behavior drops success by 15–25%.

Component Validation

User simulator scores 9.48/10 on information fidelity and 9.34/10 on persona consistency. The sliding‑window Rubric evaluator achieves Cohen’s κ = 0.828 versus human annotations, outperforming baselines without Rubric or sliding windows.

Failure Analysis

Typical errors are: reasoning errors (61.8%), tool errors (21.1%), and interaction errors (7.9%). Models often miss temporal/commonsense details, abandon tasks due to self‑doubt, or repeat ineffective actions when tool calls fail or user intent is ambiguous.

Conclusion

VitaBench provides a comprehensive framework for measuring agentic task complexity, quantifying the impact of reasoning, tool use, and interaction on performance, and highlighting the gap between current LLM agents and real‑world application needs. The benchmark, dataset, code, and leaderboard are fully open‑source.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BenchmarkAI EvaluationTool UseLLM agentsinteractive tasksVitaBench
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.