How Hyper‑Actor Critic Redefines Reinforcement Learning for Recommendation Systems

This article presents the Hyper‑Actor Critic (HAC) framework that splits reinforcement‑learning policies into continuous hyper‑actions and effective recommendation lists, introduces alignment and supervised losses, and demonstrates superior performance on an online simulator compared to existing RL and supervised methods.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How Hyper‑Actor Critic Redefines Reinforcement Learning for Recommendation Systems

Problem and Challenges

Reinforcement learning (RL) can optimize long‑term metrics in recommendation systems, but the action space—typically a recommendation list—is huge, discrete, and dynamic, making exploration and optimization difficult.

Proposed Solution: Hyper‑Actor Critic (HAC)

HAC decomposes the policy into two stages:

Generate a continuous hyper‑action (HA) vector using a policy network.

Transform the HA through a deterministic function that scores all candidate items, producing the effective action (EA) – the final recommendation list.

The HA acts as the parameters of a scoring function, enabling efficient exploration with DDPG in the continuous HA space while each HA uniquely determines an EA.

Model Architecture

User static features and interaction history are encoded by a Transformer into the current state. An MLP maps this state to the HA vector, which serves as the weight vector for dot‑product scoring of item embeddings produced by an encoder. During inference, HA is sampled from a Gaussian distribution, while EA is obtained by softmax‑based categorical sampling or top‑k selection.

A critic network (MLP) evaluates the long‑term value of EA, guiding the hyper‑actor’s learning. An inverse‑transform function, implemented as the average of item embeddings, maps EA back to HA for alignment.

Learning Objectives

The overall loss comprises four components:

TD loss for the critic, using EA as the target action.

Policy loss that updates only the hyper‑actor, guided by the HA‑critic.

Action‑alignment loss that forces the HA generated by the policy to match the HA recovered from EA via the inverse transform.

Supervised BCE loss that directly predicts items in EA, providing stable point‑wise supervision.

Online Simulator Experiments

HAC was evaluated on a public‑dataset‑based online simulator and outperformed both supervised baselines and major RL frameworks. Compared with DDPG‑RA, which also splits HA and EA but aligns actions in the EA space, HAC aligns in the HA space, yielding higher rewards and lower variance.

Ablation Study

Policy decomposition reduces performance relative to a single‑action DDPG, highlighting the instability introduced by the two‑stage design.

Adding supervised learning stabilizes training, reduces reward variance, and improves results over plain DDPG.

Action alignment further boosts performance but may slow convergence.

Action Exploration Insights

HA exploration is controlled by the variance of its Gaussian sampling; too small variance hampers efficiency, while too large harms accuracy. EA exploration uses categorical sampling versus top‑k selection; sampling consistently yields better performance.

Conclusion

The Hyper‑Actor Critic framework addresses the large, discrete, and dynamic action‑space challenges of RL‑based recommendation systems by decomposing the policy into a hyper‑action generator and an effect‑action generator, jointly optimizing exploration in both spaces. Experiments on an online simulator validate its effectiveness and compatibility with existing RL solutions, while the hyper‑action concept originates from hyper‑networks and can be extended to more complex scoring functions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Recommendation Systemsreinforcement learningAI researchhyper-actor criticonline simulationpolicy decomposition
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.