How ArenaRL Enables Open‑World Travel Agents to Learn via Comparative Reinforcement Learning

Gaode Maps and Tongyi DeepResearch unveil ArenaRL, an open‑domain reinforcement‑learning framework that replaces absolute scoring with relative ranking, uses self‑play and a linear‑complexity tournament, and demonstrates measurable gains on POI ranking and complex travel‑planning tasks.

Amap Tech
Amap Tech
Amap Tech
How ArenaRL Enables Open‑World Travel Agents to Learn via Comparative Reinforcement Learning

Problem Context

Open‑ended travel queries often contain vague intents, multi‑dimensional dynamic constraints (time windows, budget, real‑time traffic, weather, companion preferences) and a huge solution space with no single correct answer. Traditional reinforcement learning that relies on absolute scalar rewards becomes noisy as models improve, leading to “discriminator collapse” where the reward signal can no longer distinguish between many acceptable candidates.

ArenaRL: Comparative Reinforcement Learning for Open‑Domain Agents

ArenaRL converts hard‑to‑score problems into stable relative‑comparison signals. Instead of assigning an absolute score to each generated plan, the method ranks multiple candidates and derives advantage signals from pairwise or groupwise comparisons. This enables continuous self‑evolution of agents without external labels.

Key Techniques

Ranking instead of scoring – For a given user instruction the agent produces several candidate solutions. Pairwise or groupwise comparisons yield a binary “better / worse” signal, which is more robust than noisy scalar rewards.

Self‑play driven evolution – Candidates compete, the weaker ones are eliminated, and the survivors are iteratively refined. This self‑play loop pushes policies toward higher performance without a fixed reward model.

Linear‑complexity elimination tournament – A seeded single‑loss knockout tournament approximates a full round‑robin comparison while keeping computational cost at O(N). The structure allows online training at scale.

Empirical Validation

Real‑world A/B tests on Gaode Maps show substantial gains:

POI search ranking metric improved from 75 to 83 .

Open‑ended travel‑planning metric rose from 69 to 80 .

Typical query categories that benefit from ArenaRL include:

Multi‑constraint city‑tour planning (e.g., low‑crowd, stroller‑friendly greenways, budget‑friendly food stops, no stairs).

Dual‑origin meeting routes with cultural stops and low‑traffic paths.

Weather‑driven itineraries that adapt to rain and limit travel time between points.

Task‑list style closed‑loop routes with diverse point‑of‑interest types and fallback options.

Open‑Source Release

The ArenaRL algorithm and its training framework qqr (built on the slime architecture) are released publicly. The repository provides adapters for the MCP environment, allowing developers to plug the framework into local or remote pipelines and reproduce the self‑evolution workflow.

Repository and paper links:

Paper / HuggingFace: https://huggingface.co/papers/2601.06487

GitHub: https://github.com/Alibaba-NLP/qqr

Outlook

ArenaRL aims to replace fragile absolute scoring with robust comparative signals, enabling scalable self‑evolution for open‑domain agents across diverse real‑world tasks.

rankingreinforcement learningopen-domainSelf-Playtravel AIArenaRL
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.