Artificial Intelligence 16 min read

Two‑Stage Constrained Actor‑Critic Reinforcement Learning for Short‑Video Recommendation and a Multi‑Task RL Framework

This article presents a two‑stage constrained actor‑critic reinforcement learning algorithm for short‑video recommendation, models the problem as a constrained MDP, details the algorithm’s stages, and reports extensive offline and online experiments showing superior watch‑time and interaction metrics, followed by a multi‑task RL framework and its evaluations.

DataFunTalk
DataFunTalk
DataFunTalk
Two‑Stage Constrained Actor‑Critic Reinforcement Learning for Short‑Video Recommendation and a Multi‑Task RL Framework

The presentation introduces three main parts: (1) a two‑stage constrained reinforcement learning algorithm for short‑video recommendation, (2) a reinforcement‑learning‑based multi‑task recommendation framework, and (3) a Q&A session.

Problem Modeling – Short‑video recommendation is cast as a constrained Markov Decision Process (CMDP) where the agent (the recommender) interacts with users (the environment) across sessions. The primary objective is to maximize total watch time while satisfying interaction‑based constraints (likes, comments, shares, etc.). Existing constrained RL methods are unsuitable due to a single critic dominating duration prediction and the high cost of searching Lagrange multipliers for multiple constraints.

Two‑Stage Constrained Actor‑Critic (TSCAC) – Stage 1 learns separate policies for each auxiliary interaction signal using distinct critics. Stage 2 learns a main policy that maximizes watch time while staying close to the auxiliary policies via a KL‑distance constraint. The optimal solution of the dual problem is derived, leading to a loss that minimizes the KL divergence to the optimal policy. Experiments on the KuaiRand dataset and online A/B tests in the Kuaishou app demonstrate that TSCAC outperforms Pareto optimization, RCPO, and baseline Learning‑to‑Rank methods on both primary and auxiliary metrics.

Multi‑Task Reinforcement Learning (RMTL) – To address multi‑task recommendation, a session‑level MDP is built. The state representation network converts user‑item features into states; an actor network (any MTL model) outputs actions (CTR/CTCVR predictions); and a multi‑critic network provides adaptive loss weights. The framework is compatible with existing MTL models and is evaluated on KuaiRand, RetailRocket, and other public datasets, achieving higher AUC, lower log‑loss, and improved CTR/CTCVR performance compared with state‑of‑the‑art baselines.

Experiments – Offline experiments compare TSCAC and RMTL against Behavior Cloning, Wide&Deep, DeepFM, RCPO, and Pareto methods, showing significant gains in watch time, click, like, and comment metrics. Online A/B tests confirm statistical significance for watch‑time improvement (≈0.1%). RMTL’s transferability is validated by pre‑training critics on different MTL models, which consistently boosts AUC and reduces log‑loss.

Conclusions – Reinforcement learning with constrained optimization and multi‑task learning provides an effective solution for long‑term optimization of recommendation systems. Properly balancing primary and auxiliary objectives, handling sparse signals, and leveraging rich user features are critical for stable convergence and improved user experience.

multi-task learningRecommendation systemsreinforcement learningshort videoconstrained optimization
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.