Applying Reinforcement Learning to E‑commerce Traffic Control: Practices and Future Directions
This talk by JD Retail's Zhao Yu explains how reinforcement learning is modeled and deployed for large‑scale traffic control during major sales events, detailing system architecture, reward design, offline simulation, model upgrades, and future research directions.
In this presentation, Zhao Yu, a Ph.D. from JD Retail Search Algorithm Department, introduces the use of reinforcement learning (RL) for traffic control in large‑scale e‑commerce promotions.
JD Traffic Control Technology – The system balances platform goals and long‑term value by adjusting product ranking positions to promote healthy merchant activity, provide precise traffic forecasts, and design incentive‑compatible strategies for events such as new product launches and major sales.
RL Problem Modeling – The environment is the user, actions are ranking score adjustments, rewards are derived from user feedback (primarily purchases), and the agent is the ranking policy. Three abstractions are defined: (1) achieving specific traffic targets by shifting product positions, (2) maintaining overall page efficiency (GMV/UV) while adjusting positions, and (3) treating the gap between real‑time sales and targets as a feedback‑control problem. Two solution routes are discussed: PID control and RL, with RL offering richer feature integration despite cold‑start challenges.
RL Practice in JD Promotions – The "Moon Landing" project (since 2020) implements RL‑based traffic control, evolving from relative to absolute traffic adjustments across multiple major events (618, Double 11). The system comprises three parts: an offline RL training module, an online traffic allocation module, and an offline traffic replay system for cold‑start data generation.
RL Model Details – State features include time dimensions, traffic‑target metrics, and search efficiency indicators (CTR, CVR, UV value). Actions are discretized control factors in the range [‑20%, +20%] with 5% steps to ensure stable online deployment. The reward function balances target completion (C) with a penalty term when traffic exceeds the target, using parameters a, b > 1.
Online Ranking Formula – The ranking adjustment Δrk is proportional to the product of predicted GMV (pCTR × pCVR × price) and the RL control factor α, moderated by the original rank position via log(rk₀) and a stop‑loss parameter β to prevent efficiency degradation.
Offline Traffic Simulation – To address RL cold‑start, an offline simulator assumes pCTR depends only on position and pCVR only on the product, generating synthetic state‑action‑reward tuples for initial training. As real data accumulates, the training mix gradually shifts from simulated to real data.
Model Upgrades – Future improvements include hierarchical RL (high‑level hourly agents delegating to request‑level low‑level agents) and multi‑objective RL that simultaneously optimizes traffic targets, click‑through rate, conversion rate, add‑to‑cart rate, and efficiency. Weighting mechanisms (w₁, w₂) fuse multiple rewards into a single objective.
Future Directions – Planned enhancements focus on finer‑grained actions and personalized features, adaptive step‑size control for small‑target traffic, automated multi‑objective fusion, and refined ranking formulas that incorporate baseline (zero‑policy) comparisons.
Overall, the talk demonstrates how RL can continuously improve e‑commerce traffic allocation, delivering higher merchant satisfaction and platform revenue while maintaining user experience.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.