Hierarchical Semantic RL Tackles Dynamic Action Spaces in Recommendations

Researchers from Kuaishou, Fudan and Tianjin University introduce the Hierarchical Semantic Reinforcement Learning (HSRL) framework, which maps high‑dimensional, dynamic item spaces into a fixed‑size semantic action space via semantic IDs, employs a hierarchical policy network and multi‑level critics, and demonstrates 13‑18% gains on public datasets and an 18.4% ad spend lift in billion‑scale online tests.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Hierarchical Semantic RL Tackles Dynamic Action Spaces in Recommendations

In modern recommendation systems, the action space can explode because each candidate item is treated as an independent action, making policy learning unstable and inefficient, especially when the catalog updates continuously. Reinforcement learning (RL) promises to optimize long‑term user value, but the massive, dynamic action space remains a fundamental obstacle.

Hierarchical Semantic Reinforcement Learning (HSRL) Framework

The authors propose HSRL, which constructs a fixed‑dimensional semantic action space by assigning each item a hierarchical semantic ID. This transforms the original high‑dimensional, dynamic item space into a discrete, stable token set, enabling RL agents to operate on a compact, semantically structured action space.

Key Technical Components

Semantic Action Space (SAS) : Items are encoded into a sequence of semantic IDs using a hierarchical clustering method (e.g., RQ‑Kmeans). The resulting token set has a constant dimension regardless of catalog size, allowing zero‑shot generalization to new or long‑tail items.

Hierarchical Policy Network (HPN) : A strict autoregressive policy generates tokens level by level. After producing a higher‑level token, a residual‑state module subtracts the already decided semantic information from the global state, providing a residual state for the next level. This coarse‑to‑fine decision process isolates the focus of each layer and improves precision.

Multi‑Level Critic Network (MLC) : Each intermediate token receives a local value estimate. A learnable aggregation mechanism distributes the sparse final reward back to every decision step, mitigating the credit‑assignment problem in long sequences and stabilizing gradient estimation.

Experimental Evaluation

HSRL was evaluated on two public sequential recommendation benchmarks (RL4RS and ML1M) and on a billion‑scale industrial dataset from Kuaishou’s short‑video ad platform.

Offline Results

On RL4RS, HSRL achieved a total reward of 12.013, a 13.4% improvement over the strongest baseline (HAC).

On ML1M, HSRL reached a total reward of 18.773, surpassing the best baseline (CHIRP) by 6.7%.

Ablation Study

Removing entropy regularization reduced total reward by 18.7%.

Removing the hierarchical policy (HPN) caused a 15.3% drop.

Removing the multi‑level critic (MLC) decreased reward by 10.6%.

Omitting behavior‑cloning loss led to a slight performance decline, confirming the benefit of supervised signals in early training.

Online A/B Test

HSRL was deployed in the precision‑ranking stage of Kuaishou’s ad platform. In a 7‑day A/B experiment covering 15% of traffic, the framework increased expected ad spend by 18.4%, demonstrating both engineering robustness at billion‑user scale and a substantial boost to long‑term business value.

Conclusion and Outlook

The HSRL framework shows that constructing a fixed, semantically structured action space, combined with hierarchical decision making and fine‑grained credit assignment, can overcome the long‑standing dynamic action‑space bottleneck in industrial‑scale recommendation. The work suggests that semantic‑aware sequential decision making will be a key direction for next‑generation, scalable, and sustainable intelligent recommendation systems.

Paper title: Hierarchical Semantic RL: Tackling the Problem of Dynamic Action Space for RL‑based Recommendations

Paper link: https://arxiv.org/abs/2510.09167

Code repository: https://github.com/MinmaoWang/HSRL

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large-scale deploymenthierarchical policyindustrial experimentsmulti-level criticsemantic action space
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.