Exact‑K Recommendation: Graph Attention Networks and RL from Demonstrations Explained
This article introduces the Exact‑K recommendation problem, highlights its differences from traditional Top‑K approaches, and presents a novel solution combining Graph Attention Networks (GAttN) with Reinforcement Learning from Demonstrations (RLfD), backed by extensive experiments showing superior performance on real-world datasets.
1. Background
Traditional recommendation models usually follow a Top‑K paradigm: each item’s click‑through rate (CTR) is estimated, items are sorted, and the top K are shown to the user. This simple approach ignores the intrinsic relationships among recommended items.
The goal of this work is to address a novel yet widespread problem called Exact‑K recommendation, which focuses on a constrained combinatorial optimization to select a set of K items that maximizes the probability of user clicks while satisfying constraints such as diversity.
2. Problem Definition
2.1 Exact‑K Recommendation
Given a candidate set S containing N items, the objective is to choose a subset of K items and present them as a single screen to the user, maximizing the overall click probability. Pairwise constraints between items may be required, which can be expressed as a graph where nodes represent items and edges exist only when the constraint is satisfied. The Exact‑K problem thus reduces to finding a K‑clique (a complete subgraph of K nodes) with maximum weight.
3. Method
3.1 Graph Attention Networks (GAttN)
The proposed model, GAttN, adopts an encoder‑decoder architecture. The encoder uses a Transformer‑style multi‑head self‑attention mechanism (instead of RNNs) because the items are unordered but still influence each other. The encoder consists of three sub‑layers: input embedding, multi‑head self‑attention, and a feed‑forward network.
The decoder generates the K‑item set A, which corresponds to a K‑clique in the graph. It employs a recurrent neural network and a pointer‑network style attention to sequentially select items, ensuring that each newly selected item is compatible with previously chosen items via a masking operation.
3.2 Reinforcement Learning from Demonstrations (RLfD)
Supervised learning alone (cross‑entropy loss) does not directly optimize the screen‑level click probability, creating a gap between training objective and real goal. To bridge this gap, the method first pre‑trains the network with demonstrations (Learning from Demonstrations) and then fine‑tunes it using policy‑gradient reinforcement learning (Learning from Rewards).
During supervised pre‑training, historical logs provide real item sequences, and the decoder is trained to predict the next item given the previous ground‑truth item, mitigating exposure bias.
For reinforcement learning, the state is the encoder output (item embeddings), the action is the selected K‑item set, and the policy is the joint probability distribution over items. A reward estimator, trained as a binary classifier on historical data, predicts the reward for any (user, item‑set) pair. The policy is updated with REINFORCE, using a mask to prevent invalid selections and a hill‑climbing strategy that keeps the best of five sampled candidates.
The final loss combines supervised and reinforcement components, weighted by a hyper‑parameter λ that gradually shifts emphasis from supervision to reinforcement during training.
4. Experiments
4.1 Evaluation Metrics
Standard ranking metrics (NDCG, MAP) are unsuitable for Exact‑K. Two offline metrics are defined: Precision@K (accuracy‑based) and Hit Ratio@K (coverage‑based), measuring how many of the K recommended items match the user’s true preferences.
4.2 Baselines
Three typical ranking learning methods are compared: Pointwise, Pairwise, and Listwise. Listwise (e.g., DLCM) aligns best with Exact‑K because it models contextual relationships among items. The Listwise baseline uses GRU; replacing GRU with Transformer further improves performance.
4.3 Datasets
Public MovieLens datasets are constructed with settings (K=4, N=20) and (K=10, N=50). Additionally, a real‑world Taobao dataset (K=4, N=50) is used.
4.4 Results
Across all datasets, the proposed GAttN + RLfD method achieves the highest Precision@K and Hit Ratio@K, confirming the advantage of Listwise modeling with multi‑head self‑attention. Ablation studies show that a moderate λ (e.g., 0.5) yields the best trade‑off between supervision efficiency and reinforcement sufficiency.
Overall, the combination of Graph Attention Networks and Reinforcement Learning from Demonstrations provides an effective and efficient solution to the Exact‑K recommendation problem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
