How Non‑Autoregressive Generative Models Transform Recommendation Reranking
This article presents a KDD‑2024 accepted solution that replaces autoregressive generators with a non‑autoregressive model for video recommendation reranking, detailing the challenges, model architecture, novel loss function, extensive offline and online experiments, and practical Q&A from the conference.
Background and Motivation
Reranking is the final stage of a recommendation pipeline, directly shaping the ordered list of videos shown to users. Unlike earlier ranking stages that score items independently, reranking must consider interactions among videos to maximize overall user utility.
Challenges of Reranking
Given n candidates from the coarse ranker, the reranker must select m videos and order them, leading to an astronomically large candidate sequence space. Exhaustive search is infeasible, and only one sequence is exposed to the user, causing severe sample sparsity. Moreover, online systems demand low latency under strict resource constraints.
Existing Approaches
One‑stage methods treat reranking as a retrieval task, selecting the top‑k items based on pointwise scores. This creates a logical inconsistency because the ranking order after reranking differs from the pre‑ranking order, rendering the scores less meaningful.
Two‑stage methods adopt a generator‑evaluator framework. The generator produces multiple plausible sequences, and the evaluator selects the best one. Generators are either heuristic or autoregressive generative models, which suffer from high inference latency and error accumulation.
Proposed Non‑Autoregressive Model (NAR4Rec)
We introduce a non‑autoregressive generative model that produces the entire recommendation sequence in a single inference pass, dramatically reducing latency.
Matching Model for Variable Vocabulary
To address the sparsity of recommendation sequences and the variable candidate set, we design a matching model consisting of a candidate encoder and a position encoder. Position embeddings are shared across samples, enabling efficient training on sparse data.
The inner product between candidate and position embeddings yields a probability matrix P(i, j), where P(i, j) denotes the probability of placing candidate i at position j .
Decoding Strategies
Autoregressive decoding factorizes the sequence probability into a product of conditional probabilities, while our non‑autoregressive approach assumes conditional independence across positions, leading to a multi‑modal distribution problem.
Sequence Non‑Likelihood Loss
Standard maximum‑likelihood training aims to increase the probability of observed sequences, but in recommendation the exposed sequence may be sub‑optimal. We therefore propose a sequence non‑likelihood loss that simultaneously maximizes the likelihood of high‑utility sequences and minimizes the likelihood of low‑utility ones.
Given a candidate set X and a negative exposure sequence Y, the loss is defined as L = -log(1 - P(Y|X)), where a smaller P(Y|X) yields a smaller loss.
Experiments
Offline evaluation on the Avito dataset (5 candidates, 5 generated items) uses AUC and NDCG, while on a proprietary Kuaishou dataset (60 candidates, 6 displayed items) we report Recall@6 and Recall@10. NAR4Rec achieves state‑of‑the‑art performance across all metrics and matches the inference speed of pointwise ranking models.
Online A/B testing confirms significant uplift in user engagement, with the non‑autoregressive model delivering comparable or better gains than autoregressive baselines while consuming far less inference time.
Conference Q&A Highlights
Handling variable sequence length : Position embeddings are fixed‑length; to support variable lengths one can predict the target length as an auxiliary task, similar to NLP approaches.
Effect of swapping two items in the label sequence : Swapping changes the probability matrix and thus the loss, because each position’s embedding interacts differently with the candidate set.
Contrastive decoding : The current implementation greedily selects the highest‑scoring token per position, but it can be extended to beam search or top‑k sampling for multiple candidate sequences.
Distinguishing positive and negative sequences : We use a post‑hoc interaction label that aggregates pointwise item labels; sequences above a threshold are treated as positive, below as negative.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
