Overview of Sequential Recommendation Models

The article surveys sequential recommendation models from early non-deep approaches like FPMC, through RNN-based GRU4Rec and CNN-based Caser, to Transformer-based methods such as SASRec, BERT4Rec, TiSASRec, and recent contrastive-learning techniques, recommending SASRec or its variants for production use.

NetEase Media Technology Team
NetEase Media Technology Team
NetEase Media Technology Team
Overview of Sequential Recommendation Models

Recommendation systems are widely used in online platforms. Users' behaviors naturally have a temporal order and their interests evolve over time. Modeling the sequential relationship and interest drift can greatly improve recommendation quality. This article reviews the development of sequential recommendation models in chronological order, covering non‑deep methods, early deep learning approaches, Transformer‑based models, and recent contrastive‑learning attempts.

Early non‑deep learning methods

FPMC (Factorizing Personalized Markov Chains) was introduced in WWW 2010 for next‑basket recommendation. It combines matrix factorization with a Markov chain, treating each interaction time as a basket and predicting the next item by summing a user‑specific MF score and a transition‑matrix score. Training uses an S‑BPR loss that extends BPR by considering the current item.

FPMC item transition matrix
FPMC item transition matrix

Deep learning methods before Transformers

GRU4Rec (ICLR 2016) applies recurrent neural networks to session‑based recommendation. User interaction sequences are embedded and fed into a GRU; the model predicts scores for candidate items. To utilize GPU parallelism, short sequences are concatenated into longer batches, and hidden states are reset when the user changes. Training uses BPR or the proposed TOP1 loss.

GRU4Rec architecture
GRU4Rec architecture

GRU4Rec+ (CIKM 2018) improves negative sampling (adding samples from a predefined distribution) and loss functions (TOP1‑max, BPR‑max) and adopts weight‑tying between input and output embeddings.

Effect of loss functions and extra negatives in GRU4Rec+
Effect of loss functions and extra negatives in GRU4Rec+

Caser (WSDM 2018) treats the embedding sequence as a 2‑D image and applies convolutional filters along the temporal and embedding dimensions. Two types of filters capture horizontal (time‑wise) and vertical (embedding‑wise) patterns, followed by max‑pooling and a fully‑connected layer.

Caser architecture
Caser architecture

Transformer‑based methods

SASRec (ICDM 2018) adopts the decoder part of the Transformer with a causal self‑attention mechanism. Input consists of recent item embeddings plus learnable positional encodings; the model predicts the next item using a binary cross‑entropy loss with one negative sample per positive.

SASRec architecture
SASRec architecture

BERT4Rec (CIKM 2019) formulates sequential recommendation as a masked item prediction (Cloze) task, using a bidirectional Transformer encoder. During training, a subset of items is masked and the model learns to reconstruct them; during inference, a [Mask] token is appended to the sequence to predict the next item. Training uses categorical cross‑entropy with all non‑target items as negatives.

BERT4Rec architecture
BERT4Rec architecture

TiSASRec (WSDM 2020) incorporates relative positional encoding and explicit time‑interval embeddings into the self‑attention layer, allowing the model to capture both order and temporal gaps between interactions. Time intervals are discretized and embedded separately for Key and Value vectors.

TiSASRec time‑interval modeling
TiSASRec time‑interval modeling

FMLP‑Rec (WWW 2022) replaces the self‑attention layer with a filter layer that applies a discrete Fourier transform (DFT) to the embedding sequence, learns a frequency‑domain filter, multiplies element‑wise, and transforms back via inverse DFT. This reduces complexity to O(N log N) while offering interpretability of frequency‑wise filtering.

FMLP‑Rec architecture
FMLP‑Rec architecture

Contrastive‑learning methods

CLS4Rec (SIGIR 2021) adds a contrastive learning task to the next‑item prediction objective. For each user sequence, three augmentations (item crop, item mask, item reorder) generate positive pairs, while other sequences in the batch serve as negatives. The contrastive loss is computed with a cross‑entropy classifier over 2N‑1 classes.

CLS4Rec joint training
CLS4Rec joint training

Experiments show that item‑crop consistently improves performance, while mask and reorder may have mixed effects depending on the dataset.

Conclusion and practical advice

The survey starts from the matrix‑factorization based FPMC, moves through RNN‑based and CNN‑based models, and then to Transformer‑based and contrastive‑learning approaches. Transformer‑based models, especially SASRec and its variants, have become the dominant paradigm. For production systems, SASRec (or its improved versions such as FMLP‑Rec) is recommended; the authors report successful deployment of SASRec and FMLP‑Rec for video recommendation in NetEase News.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep Learningcontrastive learningTransformerrecommender systemssequential recommendation
NetEase Media Technology Team
Written by

NetEase Media Technology Team

NetEase Media Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.