Reinforcement Learning for Product Ranking: Model Design, Experiments, and Online Deployment
This article presents a comprehensive study of using reinforcement learning to improve e‑commerce product ranking, covering the limitations of traditional scoring, the design of context‑aware models, a pointer‑network based sequence generator, various RL algorithms, extensive offline evaluations, and successful online deployment with future research directions.
The article begins by highlighting the importance of product ranking in e‑commerce, where the goal is to present high‑quality items early to match user intent. Traditional pointwise scoring models ignore contextual effects, leading to sub‑optimal rankings when the order of items influences conversion probabilities.
Three models are evaluated on 54 million daily samples: a simple DNN pointwise model, an AE‑rerank model that adds global statistical features of surrounding items, and an enhanced model that further incorporates CNN and RNN modules to capture sequential context. Experiments show that incorporating context consistently improves AUC and expected purchase metrics.
Next, the authors discuss why directly sorting by item scores is insufficient. They propose a DCG‑like evaluation that multiplies predicted purchase probabilities by position discounts, demonstrating that this method better distinguishes good from bad sequences for both simple and context‑aware models.
To generate optimal sequences, a pointer‑network‑based generator is introduced. The generator takes a candidate set of M items and outputs an ordered subset of N items (N < M) using a single LSTM decoder and two‑branch DNN feature extraction. This design removes the need for fixed M and N values and reduces inference latency by half compared to a dual‑LSTM pointer network.
The generator is trained via reinforcement learning, where the evaluator model provides reward signals. Several RL algorithms are explored: the basic REINFORCE loss, a modified REINFORCE_2 that normalizes rewards, REINFORCE_3 with reward clipping, a policy‑based PPO implementation, and PPO_MC which estimates the baseline V‑value by Monte‑Carlo sampling of multiple generated sequences.
Offline experiments compare entropy convergence and the “better‑percent” (BP) metric, showing that PPO_MC converges fastest and achieves the highest BP. The authors also examine reward shaping strategies, such as subtracting the mean reward of the candidate set, to improve gradient signal quality.
Different training data scopes (first‑page vs. first‑two‑pages) and candidate set sizes (17, 30, 50, 100) are evaluated, revealing that larger candidate pools generally yield better performance at the cost of higher latency.
To align optimization objectives with business goals, the reward is extended from expected purchase count (PAY) to expected GMV (GMV) and a weighted combination (PAY_GMV). Offline results confirm that PAY excels in conversion, GMV in revenue, and PAY_GMV balances both.
Online deployment involves daily retraining on the latest two weeks of data, with the evaluator trained for five hours and the generator for six hours before model rollout. Real‑world A/B tests during the National Day and Double‑11 events demonstrate significant lifts in click‑through, conversion, and GMV, validating the RL approach.
Future work includes integrating a GAIL framework that adds a discriminator to keep generated sequences close to the training distribution, addressing the limitations of AUC as an offline metric, and exploring world‑model based evaluators and real‑time RL systems to further close the offline‑online gap.
Overall, the study shows that reinforcement learning, when combined with a robust sequence evaluator and a flexible pointer‑network generator, can substantially improve product ranking performance in large‑scale e‑commerce platforms.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.