Multi‑Objective Deep Reinforcement Learning Framework for E‑commerce Traffic Allocation (MODRL‑TA)
MODRL‑TA is a multi‑objective deep reinforcement learning framework that unites independent Q‑learning agents, a cross‑entropy‑based decision‑fusion module, and progressive data‑augmentation to overcome cold‑start and multi‑objective trade‑offs in e‑commerce traffic allocation, delivering up to 18% more impressions, 4% higher CTR and 5% higher CVR in live tests.
The paper, accepted at CIKM 2024, presents a multi‑objective deep reinforcement learning (DRL) framework designed to optimize traffic allocation on e‑commerce platforms. The framework, named MODRL‑TA, integrates multi‑objective Q‑learning (MOQ), a decision‑fusion algorithm based on the cross‑entropy method (DFM), and a progressive data‑augmentation system (PDA) to address the limitations of existing ranking‑learning methods and reinforcement‑learning approaches.
Background / Current Situation – Traffic control in modern e‑commerce platforms reallocates natural traffic by adjusting the post‑ranking positions of items. Effective traffic control promotes merchant growth, satisfies customer demand, and maximizes platform-wide benefits. Existing sorting‑learning methods ignore the long‑term value of traffic distribution, while standard RL methods struggle to balance multiple objectives and suffer from cold‑start issues in real‑world data.
Challenges – Heuristic methods focus on a single item’s gain and ignore interactions among items. Most prior work maximizes a single utility (e.g., click‑through rate) and treats multi‑objective reinforcement learning (MORL) as a static Pareto‑optimal problem, which cannot adapt to dynamic business priorities. Additionally, RL models face cold‑start problems due to sparse online data.
Solution – MODRL‑TA – The proposed framework consists of three components:
Multi‑Objective Q‑Learning (MOQ) : multiple independent RL models, each optimizing a specific objective (e.g., CTR, conversion). Each model outputs a Q‑value for the action of inserting an item at a particular position. The state includes user profile features, query attributes, historical user behavior, contextual item features, and aggregated feedback features.
Decision‑Fusion Algorithm (DFM) : a cross‑entropy method (CEM)‑based module that dynamically adjusts the weights of different objectives to maximize long‑term value. The combined gain function is a weighted sum of the Q‑values of each objective (e.g., ).
Progressive Data‑Augmentation (PDA) : during the cold‑start phase, simulated offline logs are used to train MOQ. As real‑world interactions accumulate, PDA progressively replaces simulated data with authentic online data, mitigating distribution shift and cold‑start problems.
Training – Each MOQ model uses a Deep Q‑Network (DQN) as the base learner. Independent evaluation and target networks (θ_i and θ_T_i) stabilize learning. The loss for each model is L_i(θ_i) = … (see image ), and the total loss aggregates all objectives.
Offline Evaluation – Using MORL‑FR as a baseline (CTR Reward: 5.88, CVR Reward: 0.63), MODRL‑TA with 100% simulated data already outperforms the baseline. With 100% real data, CTR Reward remains stable while CVR Reward rises to 0.97. The fully optimized MODRL‑TA achieves the highest results (CTR Reward: 12.20, CVR Reward: 2.25), demonstrating the framework’s effectiveness in improving e‑commerce metrics.
Online A/B Test – A two‑week online experiment compared MODRL‑TA against the PID algorithm. MODRL‑TA increased impressions by up to 18.0%, CTR by up to 4.2%, and CVR by up to 5.1%. The model now serves roughly 600 million daily active users.
Future Outlook – Further research will focus on more refined algorithmic designs, stronger computational resources, and stable integration of multi‑objective learning in dynamic environments.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.