Artificial Intelligence 7 min read

Two-Stage Constrained Actor‑Critic (TSCAC) for Short‑Video Recommendation

The paper models short‑video recommendation as a constrained Markov decision process and introduces a two‑stage constrained actor‑critic algorithm that jointly maximizes watch time while satisfying multiple interaction constraints, demonstrating superior offline and online performance on the KuaiRand dataset and Kuaishou app.

Kuaishou Tech

Apr 27, 2023

Two-Stage Constrained Actor‑Critic (TSCAC) for Short‑Video Recommendation

In short‑video recommendation, users interact by scrolling through videos and provide two types of feedback: watch time and interaction signals (likes, follows, comments, shares, etc.). Because watch time is closely linked to retention and DAU, the primary optimization goal is to increase total watch time, while interaction metrics reflect user satisfaction and should be kept within constraints.

The authors formulate the recommendation problem as a Constrained Markov Decision Process (CMDP) that maximizes watch time under interaction constraints. Existing constrained RL methods are unsuitable due to a single critic dominating interaction estimation and the high cost of searching Lagrangian hyper‑parameters for multiple constraints.

To address these challenges, they propose the Two‑Stage Constrained Actor‑Critic (TSCAC) algorithm. In Stage 1, separate policies are learned for each auxiliary interaction signal using distinct critics. In Stage 2, a policy is learned to maximize watch time while enforcing a distance constraint to the Stage 1 policies.

Stage 1 employs Temporal‑Difference (TD) loss for the critics and an advantage‑based loss for the actors.

Stage 2 maximizes the main watch‑time reward while softly constraining the policy to stay close to the auxiliary policies. The optimal solution of the dual problem leads to a KL‑divergence regularization term.

The resulting loss minimizes the KL distance between the learned policy and the optimal solution, effectively weighting actions according to the auxiliary policies’ assessment of their importance.

Offline experiments on the public KuaiRand dataset compare TSCAC with Behavior Cloning Wide&Deep, DeepFM, the state‑of‑the‑art constrained RL method RCPO, and Pareto‑optimal recommendation. TSCAC achieves the highest watch‑time improvement and outperforms baselines on click, like, and comment metrics.

Online A/B tests on the Kuaishou short‑video platform use a Learning‑to‑Rank baseline. TSCAC shows statistically significant watch‑time gains (≈0.1%) over RCPO and improves all auxiliary interaction metrics, matching or surpassing an Interaction‑only Actor‑Critic.

The authors conclude that modeling recommendation as a constrained RL problem and applying the two‑stage TSCAC algorithm effectively balances the primary watch‑time objective with multiple interaction constraints, and they suggest extending TSCAC to other recommendation systems and deterministic policies as future work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

reinforcement learning actor-critic short video constrained optimization online A/B testing offline evaluation

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.