Attention Mechanisms in Deep Learning Recommendation Models: A Survey

This article surveys the application of attention mechanisms in deep learning recommendation systems, reviewing models such as AFM, DIN, DIEN, DSIN, Behavior Sequence Transformer, Deep Spatio‑Temporal Networks, and ATRank, and discusses their architectures, attention types, advantages, and limitations.

DataFunTalk
DataFunTalk
DataFunTalk
Attention Mechanisms in Deep Learning Recommendation Models: A Survey

Attention mechanisms have been widely adopted in image processing, natural language processing, reinforcement learning, and recommendation systems to enhance model expressiveness by assigning importance weights to feature interactions.

1. AFM: Attentional Factorization Machines

AFM builds on Factorization Machines by inserting a simple linear attention network that learns a weight for each pair‑wise feature interaction, turning the original quadratic term into a weighted sum.

AFM’s output is obtained by applying the learned attention weights to the interaction terms, but it does not incorporate deeper networks, limiting its capacity compared to more expressive DNN‑based models.

2. DIN: Deep Interest Network

DIN introduces an attention module that evaluates the relevance of each historical user behavior to the target item, allowing the model to focus on important behaviors while ignoring irrelevant ones.

The attention weight for a behavior is computed by feeding the user embedding, the target item embedding, and their difference into an MLP, producing a softmax‑normalized importance score.

3. DIEN: Deep Interest Evolution Network

DIEN first extracts user interests from the behavior sequence using a GRU, then applies an attention‑augmented GRU to highlight interest points that are most related to the target item.

The attention score is calculated as a traditional uᵀWv similarity between the target item embedding v and each historical behavior embedding u_i, followed by a softmax to obtain normalized weights.

4. DSIN: Deep Session Interest Network

DSIN processes each user session with a Transformer to obtain a session representation, then feeds the sequence of session vectors into a bidirectional LSTM. Two attention layers are used: one self‑attention on the latent semantic space and another attention that incorporates the target item.

The final representation concatenates user features, target item features, session‑interest vectors, and context‑aware session vectors before passing them to a fully connected layer.

5. Behavior Sequence Transformer

This model embeds the user’s behavior sequence and then applies a standard Transformer layer to capture long‑range dependencies before prediction.

6. Deep Spatio‑Temporal Neural Networks

The model takes target ad features, contextual ad features, clicked ad features, and non‑clicked exposure features, embeds them, and then applies two types of attention: a self‑attention over context items and an interaction‑based attention that also incorporates the target ad embedding.

7. ATRank: An Attention‑Based User Behavior Modeling Framework

ATRank divides the model into raw feature space, behavior embedding space, latent semantic space, behavior interaction layers, and downstream application layers. After projecting behavior vectors into multiple semantic spaces, a self‑attention mechanism aggregates them.

Overall, attention in recommendation models can be categorized into self‑attention (e.g., ATRank) and traditional soft attention (e.g., DIN, DIEN, DSIN) where the latter typically computes a uᵀWv similarity between historical behavior embeddings u and the target item embedding v. DIN uniquely combines u, v, and their difference as input to an MLP to obtain attention weights.

References

AFM paper

DIN paper

DIN explanation

DIEN paper

DSIN paper

Behavior Sequence Transformer paper

Deep Spatio‑Temporal Networks paper

ATRank paper

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep LearningCTR predictionattentionRecommendation Systemsmodel architectures
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.