Artificial Intelligence 15 min read

Applying Reinforcement Learning to Hotel Ranking at Ctrip: Challenges, Solutions, and Preliminary Results

This article examines the limitations of traditional learning‑to‑rank for Ctrip hotel sorting, introduces reinforcement learning as a remedy, outlines three progressive implementation plans (A, B, C) with algorithm choices and engineering trade‑offs, and presents early experimental findings that demonstrate RL's potential to improve conversion rates.

Ctrip Technology

Jun 19, 2019

Applying Reinforcement Learning to Hotel Ranking at Ctrip: Challenges, Solutions, and Preliminary Results

In Ctrip's hotel ranking system, most problems are traditionally tackled with learning‑to‑rank (L2R), which relies on offline collected data that often fails to match the distribution of new business scenarios, leading to poor performance when business rules change.

The authors identify two concrete issues: (1) adjusting rankings for hotels that appear advantaged or disadvantaged in internal vs. external price comparisons, and (2) predicting user behavior when historically low‑ranked hotels are promoted in personalized or advertising feeds. Both issues stem from the inability to collect data that reflects the target distribution.

To address these gaps, the team proposes incorporating reinforcement learning (RL) to enable exploration and long‑term reward maximization. RL can generate actions (e.g., weight adjustments) and receive feedback (rewards) from real user interactions, thereby learning to handle unseen scenarios.

The paper outlines three incremental plans:

Plan A : a small‑scale pilot using DQN to adjust a few weight dimensions of the existing linear ranking model at the city level, with action‑reward exchange handled via Kafka.

Plan B : replace Kafka with Storm for real‑time streaming, allowing finer granularity (hotel‑level or user‑level) and richer feature embeddings for hotels and users.

Plan C : focus on algorithmic refinement, exploring continuous‑action methods such as DDPG, TD3, and policy‑gradient approaches (A2C, TRPO), as well as Gaussian‑process‑based Thompson sampling.

Implementation details include defining the agent's action as an adjustment factor alpha applied to the base linear model, constructing a reward function that measures conversion‑rate (CR) improvement, and setting the DQN discount factor γ to 0 to prioritize immediate CR gains.

Practical challenges encountered involve latency in Kafka‑based data pipelines, mismatched timing between actions and weight updates, and limited update frequency (hourly to three‑hour intervals), which affect the learning speed.

Preliminary online experiments in four major cities (Beijing, Shanghai, Guangzhou, Shenzhen) show a positive trend in cumulative reward, though the absolute values remain modest, indicating the need for further testing.

In conclusion, the authors argue that RL can complement L2R by providing exploration capability and long‑term optimization, and they plan to enhance the data processing architecture with streaming solutions to unlock greater RL effectiveness.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Ranking reinforcement learning Ctrip hotel RL learning-to-rank

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.