ResAct: A Reinforcement Learning Approach for Long-Term User Retention in Sequential Recommendation
The paper introduces ResAct, a reinforcement‑learning framework that improves long‑term user retention in sequential recommendation by constraining the policy space near the online‑serving policy and employing a conditional variational auto‑encoder, residual actor, and state‑action value network, achieving significant gains over existing methods on a large‑scale short‑video dataset.
Optimizing user retention is crucial for sequential recommendation systems, especially in short‑video platforms where it directly impacts core metrics such as daily active users (DAU) and dwell time. Existing recommendation algorithms mainly target short‑term objectives like click‑through rate, leaving long‑term retention insufficiently addressed.
Reinforcement learning (RL) is well‑suited for optimizing long‑term rewards, but traditional RL methods (e.g., DDPG, TD3) face challenges in this domain: user habit changes occur over long periods, exploration is costly and may degrade user experience, and retention‑related feedback is sparse, appearing only after a session ends.
To overcome these issues, the authors propose ResAct, an innovative RL method that restricts the policy search space to the vicinity of the online‑serving policy, thereby reducing training difficulty. ResAct consists of three modules—reconstruction, prediction, and selection—each implemented with neural networks.
In the reconstruction module, a conditional variational auto‑encoder (CVAE) encodes an action (video feature vector) into a latent space and decodes it back, enabling action reconstruction via latent sampling. The prediction module employs a residual actor composed of high‑level and low‑level state encoders and a residual sub‑actor that takes the reconstructed action and state features to output an action residual. The selection module uses a state‑action value network (a multilayer perceptron) to evaluate the expected return of each candidate action and selects the one with the highest value.
The sequential recommendation problem is modeled as a Markov decision process (MDP) with states (user features and session information), actions (item embeddings), a reward function based on return time and session length, and transition dynamics. States are split into high‑level and low‑level components to capture hierarchical user behavior.
Training follows an actor‑critic paradigm: the CVAE is trained with reconstruction loss, while the residual actor and value network are updated via policy‑gradient and temporal‑difference errors, respectively. Mutual‑information‑based regularization encourages the high‑level state encoder to capture information relevant to long‑term rewards while keeping the representation compact.
Experiments on a massive Kuaishou recommendation dataset (millions of sessions, tens of millions of requests) demonstrate that ResAct consistently outperforms strong baselines—including DDPG, TD3, offline deep‑learning methods (TD3_BC, IQL), and imitation‑learning approaches—by at least 10% across metrics such as return time and session length.
The authors acknowledge challenges such as the computational overhead of the three‑stage pipeline and the reliance on offline data; future work will explore online learning integration and latency‑aware optimizations to make ResAct viable in real‑time recommendation scenarios.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.