How MobilityBench Measures the Real Power of AI Route‑Planning Agents

MobilityBench is an open‑source benchmark built from over 100 000 real user queries that evaluates AI route‑planning agents with a deterministic sandbox, multi‑dimensional metrics, and support for ReAct and Plan‑and‑Execute frameworks, revealing performance gaps between open‑source and closed‑source models.

Amap Tech
Amap Tech
Amap Tech
How MobilityBench Measures the Real Power of AI Route‑Planning Agents

Overview

MobilityBench is an open‑source benchmark for evaluating map‑based AI agents. It contains 100 000 anonymized real user queries collected from 22 countries and more than 350 cities. The data cover 11 distinct scenarios grouped into four task families: basic information query (36.6 %), route information query (9.6 %), basic route planning (42.5 %), and preference‑constrained planning (11.3 %).

Data overview
Data overview

Task Taxonomy

The 11 scenarios were derived through open‑set labeling of long‑tail intents, model‑generated candidate intents, and multiple rounds of expert review, ensuring mutually exclusive and comprehensive coverage.

Ground‑Truth Construction

For each query a minimal tool‑call sequence (the “minimal tool invocation”) is generated according to a domain‑expert standard operating procedure. The full execution trace and intermediate results are stored as a truth file. The format is illustrated in the figure below.

Ground‑truth format
Ground‑truth format

Deterministic API Sandbox

Amap’s live map APIs vary with traffic and service availability, which hampers reproducible evaluation. MobilityBench builds a deterministic sandbox that captures and caches route and POI API responses during a build phase and replays them unchanged during evaluation. The sandbox also provides fuzzy and nearest‑neighbor matching to tolerate minor variations, ensuring that identical inputs always produce identical outputs.

Multi‑Dimensional Evaluation Protocol

Rather than a single success rate, the benchmark measures agents on five core dimensions: correctness, tool‑use efficiency, reasoning depth, token consumption, and robustness. Detailed metrics are visualized in the benchmark figures.

Evaluation dimensions
Evaluation dimensions

Supported Agent Frameworks

MobilityBench can evaluate agents built with the ReAct paradigm and the Plan‑and‑Execute paradigm, the two dominant architectures for tool‑using large language model agents.

Agent frameworks
Agent frameworks

Evaluation Results

Open‑source and closed‑source LLMs—including the Qwen series, DeepSeek, GPT, Claude, and Gemini—were tested under both frameworks. Key findings:

Closed‑source models still lead, but the gap narrows. Under Plan‑and‑Execute, Claude‑Opus‑4.5 achieves a final‑pass‑rate (FPR) of 65.77 %, while the open‑source Qwen‑3‑235B‑A22B reaches 64.16 % with lower inference cost.

Scaling laws hold. Larger Qwen models consistently improve performance.

ReAct vs. Plan‑and‑Execute. ReAct yields higher final pass rates but consumes roughly 35 % more tokens, increasing inference cost.

Thinking mode adds gain at high cost. Enabling the “Thinking” step raises Qwen‑3‑30B‑A3B’s FPR by 5.98 % but dramatically inflates token output, limiting real‑time deployment.

Result chart
Result chart

Resources

Paper: https://arxiv.org/abs/2602.22638
Code repository: https://github.com/AMAP-ML/MobilityBench
Dataset: https://huggingface.co/datasets/GD-ML/MobilityBench
AI agentsReActbenchmarkevaluationRoute PlanningPlan-and-ExecuteMobilityBench
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.