How RecGPT Leverages ChatGPT‑Style Prompt Tuning for Better Sequential Recommendation
RecGPT applies a ChatGPT‑like pre‑training and personalized prompt‑tuning paradigm to sequential recommendation, introducing a two‑stage recall mechanism that improves offline HR/NDCG metrics and yields modest online interaction gains in a real‑world short‑video platform.
Highlights
Deployed in Kuaishou's recall scenario after submission in April 2024.
Treats item IDs as tokens and builds a ChatGPT‑style pre‑training‑fine‑tuning pipeline.
After auto‑regressive pre‑training, augments sequences for personalized fine‑tuning and merges two retrieval results via a two‑stage recall.
On a small offline dataset with short sequences, achieves higher Hit Rate (HR) and NDCG than baselines and yields modest interaction gains online.
Ablation studies confirm the contributions of personalized fine‑tuning and the two‑stage recall.
Experimental details such as side‑information, negative‑sampling strategy, and exact baselines are not fully disclosed.
Problem
The objective is to apply ChatGPT‑like techniques to sequential recommendation, formally maximizing the probability of the next item given a user's historical interaction sequence.
Related Work
Sequential recommendation adapts NLP models, using RNN/Transformer architectures such as GRU4Rec, SASRec, and BERT4Rec.
LLM4Rec reformulates recommendation as a natural‑language task and leverages large language models (e.g., Chat‑REC, GPT4Rec). Most existing LLM4Rec approaches rely on OpenAI APIs and cannot directly operate on raw item IDs.
Approach
Pre‑training: Auto‑regressive Generative Model
A vanilla auto‑regressive sequential recommender built on Transformer blocks predicts the next item by attending to all preceding items.
For each positive interaction, a set of negative items is sampled; binary cross‑entropy loss is used to train the model.
Personalized Prompt‑tuning
To enrich each user’s sequence, a special prompt is generated during fine‑tuning.
Using the pre‑trained model, an item ID is inserted between two adjacent items; the model infers the top‑K most likely next items, forming an expanded sequence.
The expanded sequence is then used to predict the actual next item, optimizing negative log‑likelihood.
Two‑Stage Recall
The original top‑K recall is split into two stages, each producing a user embedding and retrieving N+M=K items.
Stage 1 retrieves N items from the original sequence without expanding the last item.
Stage 2 appends the N retrieved items to the sequence as extensions and retrieves an additional M items.
Experiments
Offline Evaluation
Datasets: Amazon Sports, Beauty, Toys, and Yelp (≈30 k users, 20 k items, average sequence length 10, ≈300 k interactions).
Baselines: Popularity, SASRec, BERT4Rec, and other representative sequential recommenders.
Metrics: Hit Rate (HR) and NDCG at cutoff 5 and 10.
Results: Using only the pre‑training stage matches SASRec; adding personalized prompt‑tuning and the two‑stage recall yields ~3‑5 % improvement over SASRec on HR and NDCG.
Ablation findings:
Prompt length between 1–3 tokens provides the best trade‑off; longer prompts introduce noise and degrade performance.
With a fixed total of 10 returned items, setting N=8 or 9 (and M=2 or 1) gives the highest two‑stage recall performance.
Enabling the two‑stage recall without personalized fine‑tuning harms results, indicating the two components are complementary.
Online Evaluation
In Kuaishou's production environment, RecGPT replaced the existing ComiRec model for five days. The online A/B test reported +0.77 % comments, +0.33 % shares, and +0.15 % video plays, while click‑through rate and watch time remained stable.
Open Questions
Side‑information beyond item IDs (e.g., timestamps, user attributes) is not described.
The exact negative‑sampling strategy used during pre‑training is omitted.
Different loss functions are employed for pre‑training (binary cross‑entropy) and fine‑tuning (negative log‑likelihood); the rationale is unclear.
It is not specified whether any model parameters are frozen during fine‑tuning.
The relationship between the two‑stage recall parameters (N, M) and the prompt length K is not analyzed.
Details of the specific Kuaishou scenario (e.g., feed recommendation, video recommendation) where the online experiment was conducted are missing.
The choice of HR@5/10 as evaluation metrics, rather than deeper recall depths (e.g., @20, @50) used by baseline systems, is not justified.
Additional public benchmarks such as Amazon Books, Taobao, or newer Kuaishou datasets (KuaiRec, KuaiRand) were not evaluated.
References
https://arxiv.org/pdf/2404.08675A(arXiv pre‑print, submitted 2024‑04‑06) https://zhuanlan.zhihu.com/p/694700684 (Zhihu column summarizing the work)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
