Two-Stage Constrained Actor‑Critic (TSCAC) for Short‑Video Recommendation
The paper models short‑video recommendation as a constrained Markov decision process and introduces a two‑stage constrained actor‑critic algorithm that jointly maximizes watch time while satisfying multiple interaction constraints, demonstrating superior offline and online performance on the KuaiRand dataset and Kuaishou app.
In short‑video recommendation, users interact by scrolling through videos and provide two types of feedback: watch time and interaction signals (likes, follows, comments, shares, etc.). Because watch time is closely linked to retention and DAU, the primary optimization goal is to increase total watch time, while interaction metrics reflect user satisfaction and should be kept within constraints.
The authors formulate the recommendation problem as a Constrained Markov Decision Process (CMDP) that maximizes watch time under interaction constraints. Existing constrained RL methods are unsuitable due to a single critic dominating interaction estimation and the high cost of searching Lagrangian hyper‑parameters for multiple constraints.
To address these challenges, they propose the Two‑Stage Constrained Actor‑Critic (TSCAC) algorithm. In Stage 1, separate policies are learned for each auxiliary interaction signal using distinct critics. In Stage 2, a policy is learned to maximize watch time while enforcing a distance constraint to the Stage 1 policies.
Stage 1 employs Temporal‑Difference (TD) loss for the critics and an advantage‑based loss for the actors.
Stage 2 maximizes the main watch‑time reward while softly constraining the policy to stay close to the auxiliary policies. The optimal solution of the dual problem leads to a KL‑divergence regularization term.
The resulting loss minimizes the KL distance between the learned policy and the optimal solution, effectively weighting actions according to the auxiliary policies’ assessment of their importance.
Offline experiments on the public KuaiRand dataset compare TSCAC with Behavior Cloning Wide&Deep, DeepFM, the state‑of‑the‑art constrained RL method RCPO, and Pareto‑optimal recommendation. TSCAC achieves the highest watch‑time improvement and outperforms baselines on click, like, and comment metrics.
Online A/B tests on the Kuaishou short‑video platform use a Learning‑to‑Rank baseline. TSCAC shows statistically significant watch‑time gains (≈0.1%) over RCPO and improves all auxiliary interaction metrics, matching or surpassing an Interaction‑only Actor‑Critic.
The authors conclude that modeling recommendation as a constrained RL problem and applying the two‑stage TSCAC algorithm effectively balances the primary watch‑time objective with multiple interaction constraints, and they suggest extending TSCAC to other recommendation systems and deterministic policies as future work.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.