How Hyper‑Actor Critic Redefines Reinforcement Learning for Recommendation Systems
This article presents the Hyper‑Actor Critic (HAC) framework that splits reinforcement‑learning policies into continuous hyper‑actions and effective recommendation lists, introduces alignment and supervised losses, and demonstrates superior performance on an online simulator compared to existing RL and supervised methods.
Problem and Challenges
Reinforcement learning (RL) can optimize long‑term metrics in recommendation systems, but the action space—typically a recommendation list—is huge, discrete, and dynamic, making exploration and optimization difficult.
Proposed Solution: Hyper‑Actor Critic (HAC)
HAC decomposes the policy into two stages:
Generate a continuous hyper‑action (HA) vector using a policy network.
Transform the HA through a deterministic function that scores all candidate items, producing the effective action (EA) – the final recommendation list.
The HA acts as the parameters of a scoring function, enabling efficient exploration with DDPG in the continuous HA space while each HA uniquely determines an EA.
Model Architecture
User static features and interaction history are encoded by a Transformer into the current state. An MLP maps this state to the HA vector, which serves as the weight vector for dot‑product scoring of item embeddings produced by an encoder. During inference, HA is sampled from a Gaussian distribution, while EA is obtained by softmax‑based categorical sampling or top‑k selection.
A critic network (MLP) evaluates the long‑term value of EA, guiding the hyper‑actor’s learning. An inverse‑transform function, implemented as the average of item embeddings, maps EA back to HA for alignment.
Learning Objectives
The overall loss comprises four components:
TD loss for the critic, using EA as the target action.
Policy loss that updates only the hyper‑actor, guided by the HA‑critic.
Action‑alignment loss that forces the HA generated by the policy to match the HA recovered from EA via the inverse transform.
Supervised BCE loss that directly predicts items in EA, providing stable point‑wise supervision.
Online Simulator Experiments
HAC was evaluated on a public‑dataset‑based online simulator and outperformed both supervised baselines and major RL frameworks. Compared with DDPG‑RA, which also splits HA and EA but aligns actions in the EA space, HAC aligns in the HA space, yielding higher rewards and lower variance.
Ablation Study
Policy decomposition reduces performance relative to a single‑action DDPG, highlighting the instability introduced by the two‑stage design.
Adding supervised learning stabilizes training, reduces reward variance, and improves results over plain DDPG.
Action alignment further boosts performance but may slow convergence.
Action Exploration Insights
HA exploration is controlled by the variance of its Gaussian sampling; too small variance hampers efficiency, while too large harms accuracy. EA exploration uses categorical sampling versus top‑k selection; sampling consistently yields better performance.
Conclusion
The Hyper‑Actor Critic framework addresses the large, discrete, and dynamic action‑space challenges of RL‑based recommendation systems by decomposing the policy into a hyper‑action generator and an effect‑action generator, jointly optimizing exploration in both spaces. Experiments on an online simulator validate its effectiveness and compatibility with existing RL solutions, while the hyper‑action concept originates from hyper‑networks and can be extended to more complex scoring functions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
