How Non‑Autoregressive Generative Models Transform Reranking in Recommendation Systems

This article presents a Kuaishou team's non‑autoregressive generative approach for recommendation reranking, detailing its architecture, loss design, experimental validation on Avito and Kuaishou datasets, and online A/B results that earned acceptance at KDD 2024.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How Non‑Autoregressive Generative Models Transform Reranking in Recommendation Systems

The team at Kuaishou applied a non‑autoregressive generative model to the reranking stage of video recommendation, achieving notable gains and earning acceptance to KDD 2024.

Reranking Challenges

Reranking is the final step of the recommendation pipeline, determining the ordered list shown to users. It must consider interactions among videos and maximize overall user benefit, but the combinatorial number of possible sequences makes exhaustive search infeasible, and only one sequence is exposed per impression, leading to severe sample sparsity under strict latency constraints.

Existing Approaches

One‑stage methods treat reranking as a retrieval task, selecting the top‑k candidates based on point scores. This creates a logical paradox because the ordering after scoring no longer matches the original ranking, rendering the scores unreliable.

Two‑stage generator‑evaluator frameworks split reranking into a generator that proposes multiple plausible sequences and an evaluator that selects the best one. Generators are either heuristic or autoregressive generative models.

Non‑Autoregressive Model (NAR4Rec)

To overcome the latency of autoregressive inference, the authors propose a non‑autoregressive model that produces the entire recommendation sequence in a single forward pass, dramatically reducing inference time.

The matching model consists of a candidate encoder and a position encoder. The candidate encoder learns representations for each video, while the position encoder captures position‑specific information with shared position embeddings, addressing the variable‑vocabulary and sparsity issues in recommendation data.

For a candidate video i and a position j, the inner product of their embeddings yields a probability matrix; the entry at (i, j) represents the probability of placing video i at position j.

Decoding Comparison

Autoregressive models factorize the sequence probability as a product of conditional probabilities, while the non‑autoregressive approach assumes conditional independence across positions. This independence assumption can cause a multi‑modal distribution problem, producing implausible sequences when high‑probability tokens from different modes are combined.

To mitigate this, a regularization term based on the maximum embedding similarity between the currently selected video and previously chosen videos approximates the conditional transition probability.

Sequence‑Non‑Likelihood Loss

Standard maximum‑likelihood training maximizes the probability of the exposed sequence, which may include low‑utility items. The authors introduce a sequence‑non‑likelihood loss that simultaneously maximizes the likelihood of high‑utility sequences and minimizes the likelihood of low‑utility ones, better aligning training objectives with the reranking goal of maximizing overall user benefit.

Experiments

Offline experiments on the Avito dataset (using AUC and NDCG) and on a Kuaishou offline dataset (using recall@6 and recall@10) show that NAR4Rec outperforms existing baselines across all metrics while achieving training and inference speeds comparable to point‑wise ranking models.

Online A/B testing on Kuaishou’s production system demonstrates a significant uplift in user engagement metrics.

Q&A from the KDD 2024 Session

Handling variable sequence length: The current deployment uses a fixed length; variable lengths could be addressed by adding a length‑prediction task, as explored in NLP literature.

Effect of swapping two items: Swapping changes the probability matrix and thus the loss, but because the position encoder does not encode explicit order information, the raw input order does not affect the final prediction.

Contrastive decoding: The method generates a single greedy sequence, but it can be extended with beam search or top‑k sampling to produce multiple candidates.

Defining positive vs. negative sequences: Positive sequences are those whose summed item‑level interaction labels exceed a threshold; sequences below the threshold are treated as negative.

recommendationAIgenerative modelsKDD2024non‑autoregressiveReranking
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.