Artificial Intelligence 9 min read

How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds

This article presents the SL‑MGAC framework, a supervised‑learning‑enhanced multi‑group Actor‑Critic algorithm that improves live‑stream insertion decisions in mixed short‑video and live‑stream recommendation systems, achieving higher stability and better long‑term user engagement while satisfying platform constraints, as validated by extensive offline and online experiments.

Kuaishou Tech

Aug 6, 2025

How Supervised Learning‑Enhanced Multi‑Group Actor‑Critic Boosts Live Stream Allocation in Short‑Video Feeds

Research Background

In mixed short‑video and live‑stream recommendation scenarios, the live‑stream recommendation system must decide whether to insert a live stream for each user request. Poor allocation harms long‑term user experience, reducing session length and retention. Traditional reinforcement learning (RL) methods suffer from poor convergence and instability, especially at industrial scale.

Problem Definition

The goal is a dual‑objective optimization: maximize live‑stream viewing time while respecting platform constraints that prevent overall app usage decline. This is modeled as a constrained Markov Decision Process where the reward is the posterior live‑stream watch time and the constraint measures the difference between short‑video and live‑stream durations.

Proposed Method (SL‑MGAC)

The Kuaishou team introduces a Supervised Learning‑enhanced Multi‑Group Actor‑Critic (SL‑MGAC) algorithm. It consists of three modules:

User & Live‑stream Feature Extraction : merges static, ID, and sequential features via target‑attention to produce a fused embedding.

Multi‑Group State Decomposition (MGSD) : partitions the state space into user activity groups based on prior analysis, enabling differentiated state representations for the Actor‑Critic network.

Supervised Learning‑enhanced Actor‑Critic : splits Q‑value estimation into reward and residual components, applies bucketed duration discretization, sigmoid normalization, and variance reduction techniques to stabilize training.

The overall loss combines the supervised reward loss, residual Q‑loss, and Lagrangian penalty for the constraint.

Experimental Results

Offline evaluations on a Kuaishou overseas dataset and extensive online A/B tests demonstrate that SL‑MGAC outperforms baseline methods, including standard Learning‑to‑Rank and single‑step RL models, in both effectiveness and stability. The model also shows reduced daily variance in live‑stream exposure compared to the SAC baseline.

Conclusion and Outlook

SL‑MGAC successfully brings stable RL techniques to the final decision layer of a large‑scale recommendation system, offering a solution applicable to other domains such as advertising, e‑commerce, and user growth push. The framework has been deployed in the Kuaishou overseas recommendation pipeline.