Overview of Sequential Recommendation Models
The article surveys sequential recommendation models from early non-deep approaches like FPMC, through RNN-based GRU4Rec and CNN-based Caser, to Transformer-based methods such as SASRec, BERT4Rec, TiSASRec, and recent contrastive-learning techniques, recommending SASRec or its variants for production use.
Recommendation systems are widely used in online platforms. Users' behaviors naturally have a temporal order and their interests evolve over time. Modeling the sequential relationship and interest drift can greatly improve recommendation quality. This article reviews the development of sequential recommendation models in chronological order, covering non‑deep methods, early deep learning approaches, Transformer‑based models, and recent contrastive‑learning attempts.
Early non‑deep learning methods
FPMC (Factorizing Personalized Markov Chains) was introduced in WWW 2010 for next‑basket recommendation. It combines matrix factorization with a Markov chain, treating each interaction time as a basket and predicting the next item by summing a user‑specific MF score and a transition‑matrix score. Training uses an S‑BPR loss that extends BPR by considering the current item.
Deep learning methods before Transformers
GRU4Rec (ICLR 2016) applies recurrent neural networks to session‑based recommendation. User interaction sequences are embedded and fed into a GRU; the model predicts scores for candidate items. To utilize GPU parallelism, short sequences are concatenated into longer batches, and hidden states are reset when the user changes. Training uses BPR or the proposed TOP1 loss.
GRU4Rec+ (CIKM 2018) improves negative sampling (adding samples from a predefined distribution) and loss functions (TOP1‑max, BPR‑max) and adopts weight‑tying between input and output embeddings.
Caser (WSDM 2018) treats the embedding sequence as a 2‑D image and applies convolutional filters along the temporal and embedding dimensions. Two types of filters capture horizontal (time‑wise) and vertical (embedding‑wise) patterns, followed by max‑pooling and a fully‑connected layer.
Transformer‑based methods
SASRec (ICDM 2018) adopts the decoder part of the Transformer with a causal self‑attention mechanism. Input consists of recent item embeddings plus learnable positional encodings; the model predicts the next item using a binary cross‑entropy loss with one negative sample per positive.
BERT4Rec (CIKM 2019) formulates sequential recommendation as a masked item prediction (Cloze) task, using a bidirectional Transformer encoder. During training, a subset of items is masked and the model learns to reconstruct them; during inference, a [Mask] token is appended to the sequence to predict the next item. Training uses categorical cross‑entropy with all non‑target items as negatives.
TiSASRec (WSDM 2020) incorporates relative positional encoding and explicit time‑interval embeddings into the self‑attention layer, allowing the model to capture both order and temporal gaps between interactions. Time intervals are discretized and embedded separately for Key and Value vectors.
FMLP‑Rec (WWW 2022) replaces the self‑attention layer with a filter layer that applies a discrete Fourier transform (DFT) to the embedding sequence, learns a frequency‑domain filter, multiplies element‑wise, and transforms back via inverse DFT. This reduces complexity to O(N log N) while offering interpretability of frequency‑wise filtering.
Contrastive‑learning methods
CLS4Rec (SIGIR 2021) adds a contrastive learning task to the next‑item prediction objective. For each user sequence, three augmentations (item crop, item mask, item reorder) generate positive pairs, while other sequences in the batch serve as negatives. The contrastive loss is computed with a cross‑entropy classifier over 2N‑1 classes.
Experiments show that item‑crop consistently improves performance, while mask and reorder may have mixed effects depending on the dataset.
Conclusion and practical advice
The survey starts from the matrix‑factorization based FPMC, moves through RNN‑based and CNN‑based models, and then to Transformer‑based and contrastive‑learning approaches. Transformer‑based models, especially SASRec and its variants, have become the dominant paradigm. For production systems, SASRec (or its improved versions such as FMLP‑Rec) is recommended; the authors report successful deployment of SASRec and FMLP‑Rec for video recommendation in NetEase News.
NetEase Media Technology Team
NetEase Media Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.