Reinforcement Learning for Meituan's "Guess You Like" Recommendation Ranking

Meituan enhanced its homepage “Guess You Like” recommendation slot by modeling user‑item interactions as a Markov Decision Process and applying an improved DDPG reinforcement‑learning agent that adjusts the ranking trade‑off parameter, uses advantage‑based Q decomposition, shares actor‑critic weights, and runs in a real‑time TensorFlow pipeline, delivering consistent lifts in click‑through, dwell time, and depth.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Reinforcement Learning for Meituan's "Guess You Like" Recommendation Ranking

Reinforcement learning (RL) is one of the fastest‑growing fields in machine learning, and its combination with recommendation and ranking models offers significant untapped value. This article describes how Meituan applied RL to the "Guess You Like" slot, the largest traffic‑driven recommendation placement on its homepage.

1. Overview The baseline ranking model for "Guess You Like" is a streaming Wide&Deep model. Point‑wise models lack correlation modeling among candidate items, and the system struggles to fully capture user intent. RL is introduced to optimize long‑term reward over multi‑round interactions between the agent (the recommender) and the environment (the user).

2. MDP Modeling The interaction is modeled as a Markov Decision Process (MDP) where the state represents the agent’s observation of user intent and context, the action adjusts the list‑wise recommendation list, and the reward reflects user feedback (clicks, orders). The transition probability P(s,a) captures how actions affect future states. Reward shaping includes penalties for pages without conversion and for pages where users leave without conversion.

2.1 State Modeling State features are extracted via a network that combines a 1‑D CNN over item embeddings of the user’s real‑time behavior sequence with dense and embedding features for time, location, scene, and long‑term habits. Binary sequence encoding is used to represent discretized user behavior across multiple time windows.

2.2 Action Design The existing ranking system consists of two identical Wide&Deep models (click and purchase targets) whose outputs are fused with a trade‑off parameter φ. The RL agent adjusts φ as its action, allowing a smooth transition from the baseline (a=1) and enabling clipping to mitigate instability.

2.3 Reward Shaping The reward is defined to directly optimize click‑through and order‑conversion rates, with additional penalty terms to discourage non‑converting intermediate pages and user abandonment.

3. Improved DDPG Model Initial attempts with Q‑Learning, DQN, and vanilla DDPG faced instability, low sample efficiency, and slow convergence due to limited online traffic. Improvements include:

Introducing an Advantage function to decompose Q(s,a) into V(s) + A(s,a), reducing overestimation.

Sharing state‑related weights between Actor and Critic to halve parameter count.

Adopting an on‑policy update (A2C style) for more stable gradients.

Extending the architecture to support multiple parallel agents for concurrent experiments.

These changes yielded a stable positive impact: +0.5% click‑through, +0.3% dwell time, and +0.3% depth while maintaining order rate.

4. Lightweight Real‑Time DRL System Based on TensorFlow To support online learning, a pipeline was built that collects features and feedback from Kafka, constructs MC episodes, trains DRL models with TensorFlow, and serves them via TF‑Serving and Tair. Optimizations include incremental Z‑Score normalization for dense features, dynamic embedding lookup, pre‑processing of large item embeddings, configurable feature pipelines, and warm‑up model initialization to avoid latency spikes during model updates.

5. Summary and Outlook RL provides flexible reward shaping, expressive action design, and the ability to optimize long‑term user value. Future work includes richer action spaces (e.g., recall counts, hidden‑layer adjustments), priority sampling, and curiosity‑driven exploration. The presented methods demonstrate that RL can deliver stable, incremental gains in large‑scale industrial recommendation systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TensorFlowreinforcement learningOnline LearningDDPGMDP Modeling
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.