Artificial Intelligence 19 min read

Demystifying Attention: A Clear Guide to Its History, Types, and Why It Works

This article systematically reviews the evolution of attention mechanisms—from early additive and multiplicative forms to self‑attention and multi‑head variants—explaining their core three‑step framework, key differences, and why they have become essential across NLP, vision, and broader AI applications.

Alibaba Cloud Developer

Jul 9, 2019

Demystifying Attention: A Clear Guide to Its History, Types, and Why It Works

Introduction

Attention was first proposed in 2015 and quickly became a standard component in both natural language processing (NLP) and computer vision. It enables models to focus on the most relevant parts of the input, improving representation learning. The emergence of self‑attention in 2017 ushered in the transformer era, dramatically boosting performance on many NLP tasks.

Two‑Part Overview

The article is organized into two main parts: (1) an introductory survey of the history of attention and a unified framework that answers the fundamental question "What is attention?"; (2) a detailed enumeration of the many attention variants, illustrating their relationships and differences.

History of Attention

Key milestones include:

2015: Bahdanau et al. introduced additive (Bahdanau) attention for neural machine translation.

2015: Luong et al. presented multiplicative (Luong) attention.

2015: "Show, Attend and Tell" introduced hard/soft visual attention for image captioning.

2016‑2017: Various extensions such as hierarchical attention, attention‑over‑attention, and multi‑step attention.

2017: Vaswani et al. released "Attention Is All You Need", proposing the transformer architecture with self‑attention and multi‑head attention.

What Is Attention?

Attention can be abstracted into three functional components:

Score function – measures similarity between a context vector and a query.

Alignment function – normalizes scores (typically with softmax) to produce attention weights.

Context‑vector generation – aggregates input vectors weighted by the attention weights. Two high‑level modeling perspectives are presented:

Alignment‑Based

In this view, the model receives a context c and an input y and produces an output z . The three steps above are applied directly to compute z . The diagram below illustrates the setting:

Using Bahdanau attention as an example, the three steps are:

Score function – computes similarity between the context and each input token.

Alignment function – applies softmax to obtain attention weights.

Generate context vector – produces a weighted sum of input vectors.

Memory‑Based (QKV Model)

Here the input is split into a query q and a memory consisting of key‑value pairs (k, v) . This formulation underlies the transformer:

Address memory (score function) – find the most similar key in memory.

Normalize (alignment function) – apply softmax to the scores.

Read content (generate context vector) – combine the corresponding values using the attention weights.

Attention in Detail

All attention variants can be understood through the three‑step framework. Differences mainly appear in the score‑function and the way the context vector is generated.

Hard vs. Soft Attention

Hard attention samples a single input vector based on the attention distribution, while soft attention computes a weighted sum of all input vectors. Soft attention is differentiable and therefore more widely used.

Global vs. Local Attention

Global attention considers the entire input sequence, whereas local attention restricts the attention window to a subset, reducing noise. Variants such as local‑m and local‑p define how the window is chosen.

Score Functions

Common score functions include:

Dot product (or scaled dot product) – simple similarity in the same space.

Additive – linear transformations followed by a non‑linear combination.

Multiplicative – matrix multiplication without non‑linearity.

Self‑Attention and Multi‑Head Attention

Self‑attention removes the locality assumption of CNNs and the sequential bottleneck of RNNs, allowing constant‑time pairwise interactions and parallel computation. Multi‑head attention runs several self‑attention heads in parallel, each with its own linear projections, and concatenates their outputs.

Variants and Extensions

Beyond the basic forms, many specialized attentions have been proposed:

Hierarchical Attention Networks – apply attention at word and sentence levels for document classification.

Attention‑over‑Attention – compute mutual attention between query and document in reading‑comprehension tasks.

Convolutional Sequence‑to‑Sequence – combine CNNs with attention to achieve parallelism.

Weighted Transformer – modify the transformer weighting scheme.

Transformer‑XL – extend the transformer to handle longer contexts via segment‑level recurrence.

Why Attention Works

At its core, attention performs a weighted sum, enabling models to dynamically select relevant information based on context. This simple principle improves feature selection and representation across tasks, from language modeling (e.g., BERT, GPT) to vision and recommendation systems.

Conclusion

Attention evolved from early additive mechanisms to the powerful transformer architecture, mirroring the progression of human language learning from rote memorization to deep contextual understanding. Its ability to focus on “context is everything” makes it a universal tool for many AI applications.

References

1. Bahdanau et al., 2014. Neural Machine Translation by Jointly Learning to Align and Translate. 2. Xu et al., 2015. Show, Attend and Tell. 3. Luong et al., 2015. Effective Approaches to Attention‑Based Neural Machine Translation. 4. Vaswani et al., 2017. Attention Is All You Need. 5. Dai et al., 2019. Transformer‑XL. 6. Yang et al., 2016. Hierarchical Attention Networks for Document Classification. 7. Cui et al., 2016. Attention‑over‑Attention for Reading Comprehension. 8. Gehring et al., 2017. Convolutional Sequence‑to‑Sequence Learning. 9. https://github.com/kimiyoung/transformer-xl 10. https://lilianweng.github.io/2018/06/24/attention-attention.html 11. https://jalammar.github.io/illustrated-transformer/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning deep learning Transformer NLP Self-Attention

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.