How ArenaRL Enables Open‑World Travel Agents to Learn via Comparative Reinforcement Learning
Gaode Maps and Tongyi DeepResearch unveil ArenaRL, an open‑domain reinforcement‑learning framework that replaces absolute scoring with relative ranking, uses self‑play and a linear‑complexity tournament, and demonstrates measurable gains on POI ranking and complex travel‑planning tasks.
Problem Context
Open‑ended travel queries often contain vague intents, multi‑dimensional dynamic constraints (time windows, budget, real‑time traffic, weather, companion preferences) and a huge solution space with no single correct answer. Traditional reinforcement learning that relies on absolute scalar rewards becomes noisy as models improve, leading to “discriminator collapse” where the reward signal can no longer distinguish between many acceptable candidates.
ArenaRL: Comparative Reinforcement Learning for Open‑Domain Agents
ArenaRL converts hard‑to‑score problems into stable relative‑comparison signals. Instead of assigning an absolute score to each generated plan, the method ranks multiple candidates and derives advantage signals from pairwise or groupwise comparisons. This enables continuous self‑evolution of agents without external labels.
Key Techniques
Ranking instead of scoring – For a given user instruction the agent produces several candidate solutions. Pairwise or groupwise comparisons yield a binary “better / worse” signal, which is more robust than noisy scalar rewards.
Self‑play driven evolution – Candidates compete, the weaker ones are eliminated, and the survivors are iteratively refined. This self‑play loop pushes policies toward higher performance without a fixed reward model.
Linear‑complexity elimination tournament – A seeded single‑loss knockout tournament approximates a full round‑robin comparison while keeping computational cost at O(N). The structure allows online training at scale.
Empirical Validation
Real‑world A/B tests on Gaode Maps show substantial gains:
POI search ranking metric improved from 75 to 83 .
Open‑ended travel‑planning metric rose from 69 to 80 .
Typical query categories that benefit from ArenaRL include:
Multi‑constraint city‑tour planning (e.g., low‑crowd, stroller‑friendly greenways, budget‑friendly food stops, no stairs).
Dual‑origin meeting routes with cultural stops and low‑traffic paths.
Weather‑driven itineraries that adapt to rain and limit travel time between points.
Task‑list style closed‑loop routes with diverse point‑of‑interest types and fallback options.
Open‑Source Release
The ArenaRL algorithm and its training framework qqr (built on the slime architecture) are released publicly. The repository provides adapters for the MCP environment, allowing developers to plug the framework into local or remote pipelines and reproduce the self‑evolution workflow.
Repository and paper links:
Paper / HuggingFace: https://huggingface.co/papers/2601.06487
GitHub: https://github.com/Alibaba-NLP/qqr
Outlook
ArenaRL aims to replace fragile absolute scoring with robust comparative signals, enabling scalable self‑evolution for open‑domain agents across diverse real‑world tasks.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
