Artificial Intelligence 8 min read

Reinforcement Learning for User Retention (RLUR) in Short Video Recommendation Systems

This paper presents RLUR, a reinforcement‑learning algorithm that models user‑retention optimization as an infinite‑horizon request‑based Markov Decision Process, addressing uncertainty, bias, and delayed reward challenges to directly improve retention, DAU, and engagement in short‑video recommendation platforms.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Reinforcement Learning for User Retention (RLUR) in Short Video Recommendation Systems

The core goal of short‑video recommendation systems is to increase user retention and drive DAU growth, but retention is a long‑term feedback signal that cannot be directly optimized by traditional point‑wise or list‑wise models.

To address this, the authors formulate retention optimization as an infinite‑horizon request‑based Markov Decision Process (MDP), where the recommender acts as the agent and each user session is a step. The policy selects an action vector that aggregates multiple short‑term feedback predictions (watch time, likes, follows, comments, shares) to rank candidate videos.

Applying reinforcement learning to retention faces three main challenges: (1) uncertainty due to external factors, (2) bias across user activity levels, and (3) instability because retention rewards are delayed by hours or days.

The proposed Reinforcement Learning for User Retention (RLUR) algorithm tackles these challenges. RLUR estimates cumulative returning time using a DDPG‑style temporal‑difference learner, introduces heuristic rewards (short‑term feedback and intrinsic rewards from a Random Network Distillation network), and employs a separate critic to combine them.

To mitigate uncertainty, a regularization method uses a classification model to predict whether returning time is below a threshold, derives a lower bound via Markov’s inequality, and scales the reward by the ratio of actual to estimated lower bound.

To reduce bias, separate policies are trained for high‑activity and low‑activity user groups, each using distinct data streams, allowing the actor to minimize returning time while maximizing auxiliary rewards.

To handle delayed rewards and training instability, a soft regularization coefficient multiplies the actor loss, acting as a braking effect that stabilizes learning when the policy deviates strongly from the sample strategy.

Offline experiments on the public KuaiRand dataset compare RLUR with state‑of‑the‑art RL methods (TD3) and a black‑box optimizer (Cross‑Entropy Method). RLUR significantly outperforms baselines on returning time and secondary retention metrics, and ablation studies confirm the effectiveness of each component.

Online A/B tests in Kuaishou’s short‑video platform show that RLUR yields measurable gains in app‑open frequency, DAU, secondary retention, and 7‑day retention, with statistically significant improvements.

The paper concludes that RLUR successfully leverages reinforcement learning to directly optimize user retention, and suggests future work on offline RL and Decision Transformers for further gains.

user retentionrecommendation systemreinforcement learningshort videoKuaishouRLUR
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.