How MobilityBench Measures the Real Power of AI Route‑Planning Agents
MobilityBench is an open‑source benchmark built from over 100 000 real user queries that evaluates AI route‑planning agents with a deterministic sandbox, multi‑dimensional metrics, and support for ReAct and Plan‑and‑Execute frameworks, revealing performance gaps between open‑source and closed‑source models.
Overview
MobilityBench is an open‑source benchmark for evaluating map‑based AI agents. It contains 100 000 anonymized real user queries collected from 22 countries and more than 350 cities. The data cover 11 distinct scenarios grouped into four task families: basic information query (36.6 %), route information query (9.6 %), basic route planning (42.5 %), and preference‑constrained planning (11.3 %).
Task Taxonomy
The 11 scenarios were derived through open‑set labeling of long‑tail intents, model‑generated candidate intents, and multiple rounds of expert review, ensuring mutually exclusive and comprehensive coverage.
Ground‑Truth Construction
For each query a minimal tool‑call sequence (the “minimal tool invocation”) is generated according to a domain‑expert standard operating procedure. The full execution trace and intermediate results are stored as a truth file. The format is illustrated in the figure below.
Deterministic API Sandbox
Amap’s live map APIs vary with traffic and service availability, which hampers reproducible evaluation. MobilityBench builds a deterministic sandbox that captures and caches route and POI API responses during a build phase and replays them unchanged during evaluation. The sandbox also provides fuzzy and nearest‑neighbor matching to tolerate minor variations, ensuring that identical inputs always produce identical outputs.
Multi‑Dimensional Evaluation Protocol
Rather than a single success rate, the benchmark measures agents on five core dimensions: correctness, tool‑use efficiency, reasoning depth, token consumption, and robustness. Detailed metrics are visualized in the benchmark figures.
Supported Agent Frameworks
MobilityBench can evaluate agents built with the ReAct paradigm and the Plan‑and‑Execute paradigm, the two dominant architectures for tool‑using large language model agents.
Evaluation Results
Open‑source and closed‑source LLMs—including the Qwen series, DeepSeek, GPT, Claude, and Gemini—were tested under both frameworks. Key findings:
Closed‑source models still lead, but the gap narrows. Under Plan‑and‑Execute, Claude‑Opus‑4.5 achieves a final‑pass‑rate (FPR) of 65.77 %, while the open‑source Qwen‑3‑235B‑A22B reaches 64.16 % with lower inference cost.
Scaling laws hold. Larger Qwen models consistently improve performance.
ReAct vs. Plan‑and‑Execute. ReAct yields higher final pass rates but consumes roughly 35 % more tokens, increasing inference cost.
Thinking mode adds gain at high cost. Enabling the “Thinking” step raises Qwen‑3‑30B‑A3B’s FPR by 5.98 % but dramatically inflates token output, limiting real‑time deployment.
Resources
Paper: https://arxiv.org/abs/2602.22638 Code repository: https://github.com/AMAP-ML/MobilityBench Dataset: https://huggingface.co/datasets/GD-ML/MobilityBenchAmap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
