Artificial Intelligence 18 min read

Design and Implementation of a Home‑Page Recommendation System Using Reinforcement Learning and DPP

This article presents a comprehensive design for Zhuanzhuan's home‑page recommendation pipeline, detailing the system architecture, challenges of traffic efficiency and diversity, and a two‑stage solution that applies Proximal Policy Optimization reinforcement learning in the re‑ranking module and Determinantal Point Process optimization in the coarse‑ranking and traffic‑pool stages, followed by offline simulation, online deployment, and evaluation metrics.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Design and Implementation of a Home‑Page Recommendation System Using Reinforcement Learning and DPP

1. Business Introduction

The e‑commerce platform Zhuanzhuan follows the "multiple, fast, good, save" principle, emphasizing cost‑effectiveness and official verification to improve reputation and sales. The home‑page recommendation is a key scenario for delivering diverse and high‑quality items to users.

1.1 Business Background

Zhuanzhuan aims to provide abundant inventory for a green circular economy, requiring a recommendation system that balances quantity, quality, and cost.

1.2 System Architecture

The recommendation pipeline consists of a trigger module, recall module, coarse‑ranking module, fine‑ranking module, and re‑ranking module. The ZZ‑Rerank (re‑ranking) and ZZ‑Rank (coarse‑ranking and traffic‑pool) components are highlighted.

2. Challenges in the Home‑Page Scenario

The home page receives the highest traffic, demanding efficient capture and fair distribution of impressions across heterogeneous categories. Challenges include:

Different residual values and material counts across categories (e.g., second‑hand phones vs. other items).

Official verification constraints limit material growth for some categories.

Large variance in user actions (exposure, click, order) across categories, affecting downstream metrics.

Maintaining diversity under strict official‑verification image standards.

3. Solution Overview

3.1 Overall Design

A two‑stage approach is proposed: Stage 1 applies reinforcement learning in the re‑ranking module; Stage 2 adopts a DPP‑based strategy in the coarse‑ranking/traffic‑pool module. Both stages aim to improve comparability of traffic efficiency while preserving relevance and diversity.

3.2 Choice of Reinforcement Learning Method

Proximal Policy Optimization (PPO) is selected because it is on‑policy, can adapt to dynamic environments, incorporates importance sampling for off‑policy data efficiency, and suits the exploration needs of the home‑page recommendation.

The PPO workflow consists of eight steps: (1) collect (state, action, reward) tuples; (2) compute discounted returns; (3) evaluate advantages; (4) update the critic; (5) compute importance weights between old and new policies; (6) update the actor; (7) periodically copy the new actor to the old actor; (8) repeat.

3.3 DPP Algorithm Principle

DPP (Determinantal Point Process) transforms the combinatorial probability of a subset into a determinant of a kernel matrix L, enabling efficient modeling of relevance (user‑item match) and diversity (item‑item dissimilarity). The kernel L is positive‑semi‑definite and can be factorized as L = B·Bᵀ, where each column combines relevance scores and item embeddings.

The MAP objective maximizes log‑det(L_S) for a selected subset S, which is a submodular maximization problem solved efficiently by a greedy algorithm with Cholesky updates.

3.4 Detailed Implementation

3.4.1 Stage 1 – Re‑ranking with Reinforcement Learning

The recommendation request is modeled as a constrained Markov Decision Process (CMDP) with:

State S: 128‑dimensional vector (user features, item features, context).

Action A: 10 × (1+5+N) dimensional weight vector adjusting category and item preferences.

Reward R: vector of click‑based rewards.

Discount factor Γ = 0.99.

Constraints C: diversity requirements from the DPP component.

The objective is to maximize cumulative reward.

3.4.2 Stage 2 – Traffic‑Pool with DPP Components

Each major category (1+5+N) deploys an independent DPP module. The kernel matrix L is simplified using one‑hot vectors for sub‑categories, yielding a block‑diagonal structure that reduces MAP inference complexity from O(N³) to O(NM).

3.5 Offline Simulation and Online Deployment

Offline logs from coarse‑ranking, traffic‑pool, and fine‑ranking are stored; the online re‑ranking rules are reproduced offline. A week of exposure logs is sampled every 15 minutes to create a simulator that runs 96 decision steps per day. Metrics include click‑through rate (CTR) and cumulative reward.

4. Summary

The two‑stage solution optimizes relevance and diversity: Stage 1 uses PPO to align category‑level pCTR scores, while Stage 2 leverages DPP to enforce diversity in the traffic‑pool and coarse‑ranking stages. The approach demonstrates how reinforcement learning and determinantal point processes can be combined to improve large‑scale e‑commerce recommendation systems.

5. References

[1] Wang et al., 2020. Deep Reinforcement Learning: A Survey. [2] Schulman et al., 2017. Proximal Policy Optimization Algorithms. [3] Chen et al., 2018. Fast Greedy MAP Inference for DPP. [4] Nemhauser et al., 1978. Approximation Algorithms for Submodular Set Functions. [5] Chang et al., 2023. PEPNet: Parameter and Embedding Personalized Network.

e-commercemachine learningrecommendationRankingreinforcement learningDPP
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.