Artificial Intelligence 16 min read

Two‑Stage Constrained Actor‑Critic for Short‑Video Recommendation and a Reinforcement‑Learning Multi‑Task Recommendation Framework

The presentation introduces a two‑stage constrained actor‑critic algorithm that learns auxiliary policies for interaction signals before optimizing watch‑time under KL constraints, and a reinforcement‑learning multi‑task learning framework that models session‑level dynamics with adaptive multi‑critic weighting, both achieving significant offline and online gains in short‑video recommendation.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Two‑Stage Constrained Actor‑Critic for Short‑Video Recommendation and a Reinforcement‑Learning Multi‑Task Recommendation Framework

Today’s presentation is organized into three parts: (1) a two‑stage constrained reinforcement‑learning algorithm for short‑video recommendation, (2) a reinforcement‑learning based multi‑task recommendation framework, and (3) a Q&A session.

1. Two‑Stage Constrained Actor‑Critic (TSCAC) Algorithm

The problem is modeled as a constrained Markov Decision Process (CMDP) where the primary objective is to maximize total watch time while satisfying interaction‑based constraints (likes, comments, shares, etc.). Existing constrained RL methods are unsuitable because a single critic dominates the watch‑time signal and because multiple constraints would require costly Lagrangian hyper‑parameter search.

The proposed TSCAC consists of:

Stage One: For each auxiliary interaction signal, learn a separate policy that optimizes that signal using its own critic.

Stage Two: Learn a main policy that maximizes watch time while staying close (KL‑constrained) to the auxiliary policies learned in Stage One. The optimal solution of the dual problem is derived, leading to a loss that minimizes the KL divergence to the optimal policy.

Offline evaluation on the KuaiRand dataset shows that TSCAC outperforms Pareto‑optimal methods, RCPO, and standard actor‑critic baselines on watch time and all interaction metrics. Online A/B tests in the Kuaishou app confirm statistically significant improvements over the production Learning‑to‑Rank baseline.

2. Reinforcement‑Learning Multi‑Task Learning (RMTL) Framework

RMTL addresses the limitation of existing multi‑task learning (MTL) recommendation models that ignore session‑level dynamics. It builds a session‑based MDP where states consist of user‑item features, actions are predicted CTR/CTCVR values, and rewards are negative binary cross‑entropy losses.

The architecture includes:

A state‑representation network (embedding + MLP) that converts raw features into a state vector.

An actor network that can be any base MTL model, outputting a task‑specific action vector.

A multi‑critic network (parallel MLPs) that estimates Q‑values for each task and provides adaptive loss weights.

Experiments on public datasets (RetailRocket and Kuairand) demonstrate that RMTL consistently improves AUC, log‑loss, and s‑log‑loss over state‑of‑the‑art MTL baselines. Transferability studies show that pre‑trained critics boost the performance of various MTL backbones, and ablation experiments confirm the effectiveness of the adaptive weighting scheme.

3. Q&A Highlights

Q1: What loss functions are used for watch‑time and interaction signals? A: Watch‑time is treated as a regression task after a coarse classification of video length; interaction signals use standard classification losses. Offline evaluation focuses on AUC/GAUC, while online metrics monitor watch‑time directly.

Q2: How to handle extremely sparse signals (e.g., retention)? A: Use correlated real‑time signals (such as immediate watch‑time) as proxies and optimize them to indirectly improve the sparse long‑term metric.

Q3: Does the RL model use fine‑grained features like user ID? A: Besides user ID, many statistical and contextual features are incorporated; the RL component operates at later ranking stages, so it does not rely solely on raw IDs.

Overall, the talk illustrates how constrained RL and session‑level MDP modeling can effectively address multi‑objective optimization in large‑scale short‑video recommendation systems.

multi-task learningRecommendation systemsreinforcement learningactor-criticshort videoconstrained optimization
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.