Artificial Intelligence 14 min read

Introducing VitaBench: A Real-World Agent Benchmark That Reveals a 30% Success Gap

VitaBench, a new open‑source benchmark from Meituan’s LongCat team, evaluates LLM‑driven agents across three realistic life‑service scenarios—food ordering, restaurant dining, and travel planning—using 66 tools and quantifying reasoning, tool, and interaction complexities, exposing a mere 30% success rate on complex cross‑scene tasks.

Meituan Technology Team

Nov 3, 2025

Introducing VitaBench: A Real-World Agent Benchmark That Reveals a 30% Success Gap

VitaBench Overview

Meituan’s LongCat team has released VitaBench (Versatile Interactive Tasks Benchmark), an open‑source evaluation suite that closely mirrors real‑world life‑service scenarios. It focuses on three high‑frequency domains—food delivery ordering, restaurant dining, and travel planning—by providing an interactive environment with 66 tools.

Three Core Complexity Dimensions

The benchmark quantifies agent performance along three dimensions: deep reasoning , tool usage , and user interaction . Even state‑of‑the‑art models achieve only about 30% success on the main (complex cross‑scene) leaderboard, highlighting a large gap between current agents and real‑world needs.

Challenges in Existing Benchmarks

Simplified tool ecosystems : Prior benchmarks evaluate single API calls without modeling complex tool dependencies.

Insufficient information density : They ignore multi‑source data such as temporal, commonsense, and user profile information.

Limited model exploration : Long policy documents constrain model autonomy and long‑text instruction compliance.

Lack of interaction dynamics : User behavior diversity, intent shifts, and multi‑turn dialogue are rarely considered.

Benchmark Construction

VitaBench treats agent‑environment‑user interaction as a Partially Observable Markov Decision Process (POMDP) and decomposes task difficulty into:

Reasoning complexity (𝒞_reason) : observation space size, partial observability ratio, and number of reasoning points.

Tool complexity (𝒞_tool) : graph size/density and tool‑call chain length.

Interaction complexity (𝒞_interact) : user persona, behavior attributes, and dynamic intent evolution.

Each task involves 5‑20 service providers and up to 100 candidate products, aggregating multiple user demands into a rich search and reasoning space.

Two‑Stage Development Process

Stage 1 – Framework Design : Define 66 simplified yet functional API tools, construct a directed dependency graph, and implement a GPT‑4.1‑based user simulator that generates diverse, fuzzy user requests.

Stage 2 – Task Creation : Synthesize user personas, craft composite task instructions, augment environment data, and establish fine‑grained evaluation rubrics. The benchmark includes 300 single‑scene and 100 cross‑scene tasks.

Evaluation Methodology

All models use function‑call based agent architectures with official tool‑call formats.

User simulator powered by GPT‑4.1; evaluator built on Claude‑3.7‑Sonnet.

Each task is run four times (temperature 0.0) to compute Avg@4, Pass@4, and Pass^4 metrics.

Models are split into reasoning and non‑reasoning groups; hybrid models are evaluated in both modes.

Experimental Results

Cross‑scene tasks remain extremely challenging: the best model attains only 30.0% Avg@4, far below the 48.3% on single‑scene tasks. Multiple attempts improve Pass@4 (up to 60%) but Pass^4 stays near zero, indicating instability. “Thinking” models with chain‑of‑thought reasoning outperform non‑thinking counterparts by 5–8 points and require fewer interaction turns.

Ablation Studies

Removing each complexity dimension degrades performance, confirming their impact:

Higher reasoning points correlate with lower success rates.

Larger tool graphs (more nodes/edges) increase difficulty.

Directly providing full instructions (no user interaction) boosts success, while realistic user simulation drops performance by 15–25%.

Failure Analysis

Three dominant error categories were identified: reasoning errors (≈62%), tool errors (≈21%), and interaction errors (≈8%). Models often miss temporal or commonsense details, prematurely abandon tasks due to uncertainty about tool capabilities, and repeat ineffective actions when faced with ambiguous user requests.

Conclusion

VitaBench not only offers a comprehensive benchmark but also introduces the theoretical framework of “Agentic Task Complexity,” systematically quantifying the influence of reasoning, tool, and interaction dimensions on agent performance. It aims to drive the next stage of AI research toward truly usable agents in everyday life.

VitaBench is fully open‑source; see the project homepage, paper, code repository, dataset, and leaderboard for more details.

AI LLM Agent benchmark Reasoning tool use interaction

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.