Design and Implementation of a Home‑Page Recommendation System Using Reinforcement Learning and DPP
This article presents a comprehensive design for Zhuanzhuan's home‑page recommendation pipeline, detailing the system architecture, challenges of traffic efficiency and diversity, and a two‑stage solution that applies Proximal Policy Optimization reinforcement learning in the re‑ranking module and Determinantal Point Process optimization in the coarse‑ranking and traffic‑pool stages, followed by offline simulation, online deployment, and evaluation metrics.
1. Business Introduction
The e‑commerce platform Zhuanzhuan follows the "multiple, fast, good, save" principle, emphasizing cost‑effectiveness and official verification to improve reputation and sales. The home‑page recommendation is a key scenario for delivering diverse and high‑quality items to users.
1.1 Business Background
Zhuanzhuan aims to provide abundant inventory for a green circular economy, requiring a recommendation system that balances quantity, quality, and cost.
1.2 System Architecture
The recommendation pipeline consists of a trigger module, recall module, coarse‑ranking module, fine‑ranking module, and re‑ranking module. The ZZ‑Rerank (re‑ranking) and ZZ‑Rank (coarse‑ranking and traffic‑pool) components are highlighted.
2. Challenges in the Home‑Page Scenario
The home page receives the highest traffic, demanding efficient capture and fair distribution of impressions across heterogeneous categories. Challenges include:
Different residual values and material counts across categories (e.g., second‑hand phones vs. other items).
Official verification constraints limit material growth for some categories.
Large variance in user actions (exposure, click, order) across categories, affecting downstream metrics.
Maintaining diversity under strict official‑verification image standards.
3. Solution Overview
3.1 Overall Design
A two‑stage approach is proposed: Stage 1 applies reinforcement learning in the re‑ranking module; Stage 2 adopts a DPP‑based strategy in the coarse‑ranking/traffic‑pool module. Both stages aim to improve comparability of traffic efficiency while preserving relevance and diversity.
3.2 Choice of Reinforcement Learning Method
Proximal Policy Optimization (PPO) is selected because it is on‑policy, can adapt to dynamic environments, incorporates importance sampling for off‑policy data efficiency, and suits the exploration needs of the home‑page recommendation.
The PPO workflow consists of eight steps: (1) collect (state, action, reward) tuples; (2) compute discounted returns; (3) evaluate advantages; (4) update the critic; (5) compute importance weights between old and new policies; (6) update the actor; (7) periodically copy the new actor to the old actor; (8) repeat.
3.3 DPP Algorithm Principle
DPP (Determinantal Point Process) transforms the combinatorial probability of a subset into a determinant of a kernel matrix L, enabling efficient modeling of relevance (user‑item match) and diversity (item‑item dissimilarity). The kernel L is positive‑semi‑definite and can be factorized as L = B·Bᵀ, where each column combines relevance scores and item embeddings.
The MAP objective maximizes log‑det(L_S) for a selected subset S, which is a submodular maximization problem solved efficiently by a greedy algorithm with Cholesky updates.
3.4 Detailed Implementation
3.4.1 Stage 1 – Re‑ranking with Reinforcement Learning
The recommendation request is modeled as a constrained Markov Decision Process (CMDP) with:
State S: 128‑dimensional vector (user features, item features, context).
Action A: 10 × (1+5+N) dimensional weight vector adjusting category and item preferences.
Reward R: vector of click‑based rewards.
Discount factor Γ = 0.99.
Constraints C: diversity requirements from the DPP component.
The objective is to maximize cumulative reward.
3.4.2 Stage 2 – Traffic‑Pool with DPP Components
Each major category (1+5+N) deploys an independent DPP module. The kernel matrix L is simplified using one‑hot vectors for sub‑categories, yielding a block‑diagonal structure that reduces MAP inference complexity from O(N³) to O(NM).
3.5 Offline Simulation and Online Deployment
Offline logs from coarse‑ranking, traffic‑pool, and fine‑ranking are stored; the online re‑ranking rules are reproduced offline. A week of exposure logs is sampled every 15 minutes to create a simulator that runs 96 decision steps per day. Metrics include click‑through rate (CTR) and cumulative reward.
4. Summary
The two‑stage solution optimizes relevance and diversity: Stage 1 uses PPO to align category‑level pCTR scores, while Stage 2 leverages DPP to enforce diversity in the traffic‑pool and coarse‑ranking stages. The approach demonstrates how reinforcement learning and determinantal point processes can be combined to improve large‑scale e‑commerce recommendation systems.
5. References
[1] Wang et al., 2020. Deep Reinforcement Learning: A Survey. [2] Schulman et al., 2017. Proximal Policy Optimization Algorithms. [3] Chen et al., 2018. Fast Greedy MAP Inference for DPP. [4] Nemhauser et al., 1978. Approximation Algorithms for Submodular Set Functions. [5] Chang et al., 2023. PEPNet: Parameter and Embedding Personalized Network.
Zhuanzhuan Tech
A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.