Applying Reinforcement Learning to Hotel Ranking at Ctrip: Challenges, Solutions, and Preliminary Results
This article examines the limitations of traditional learning‑to‑rank for Ctrip hotel sorting, introduces reinforcement learning as a remedy, outlines three progressive implementation plans (A, B, C) with algorithm choices and engineering trade‑offs, and presents early experimental findings that demonstrate RL's potential to improve conversion rates.
In Ctrip's hotel ranking system, most problems are traditionally tackled with learning‑to‑rank (L2R), which relies on offline collected data that often fails to match the distribution of new business scenarios, leading to poor performance when business rules change.
The authors identify two concrete issues: (1) adjusting rankings for hotels that appear advantaged or disadvantaged in internal vs. external price comparisons, and (2) predicting user behavior when historically low‑ranked hotels are promoted in personalized or advertising feeds. Both issues stem from the inability to collect data that reflects the target distribution.
To address these gaps, the team proposes incorporating reinforcement learning (RL) to enable exploration and long‑term reward maximization. RL can generate actions (e.g., weight adjustments) and receive feedback (rewards) from real user interactions, thereby learning to handle unseen scenarios.
The paper outlines three incremental plans:
Plan A : a small‑scale pilot using DQN to adjust a few weight dimensions of the existing linear ranking model at the city level, with action‑reward exchange handled via Kafka.
Plan B : replace Kafka with Storm for real‑time streaming, allowing finer granularity (hotel‑level or user‑level) and richer feature embeddings for hotels and users.
Plan C : focus on algorithmic refinement, exploring continuous‑action methods such as DDPG, TD3, and policy‑gradient approaches (A2C, TRPO), as well as Gaussian‑process‑based Thompson sampling.
Implementation details include defining the agent's action as an adjustment factor alpha applied to the base linear model, constructing a reward function that measures conversion‑rate (CR) improvement, and setting the DQN discount factor γ to 0 to prioritize immediate CR gains.
Practical challenges encountered involve latency in Kafka‑based data pipelines, mismatched timing between actions and weight updates, and limited update frequency (hourly to three‑hour intervals), which affect the learning speed.
Preliminary online experiments in four major cities (Beijing, Shanghai, Guangzhou, Shenzhen) show a positive trend in cumulative reward, though the absolute values remain modest, indicating the need for further testing.
In conclusion, the authors argue that RL can complement L2R by providing exploration capability and long‑term optimization, and they plan to enhance the data processing architecture with streaming solutions to unlock greater RL effectiveness.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.