Artificial Intelligence 18 min read

Demystifying Attention: A Beginner’s Guide to History, Types, and Applications

This article provides a comprehensive, beginner‑friendly overview of attention mechanisms—from their origins in early neural machine translation papers to modern self‑attention, multi‑head attention, and transformer variants—explaining core concepts, common variants, and why attention has become essential across NLP and vision tasks.

Alibaba Cloud Developer

Aug 9, 2019

Demystifying Attention: A Beginner’s Guide to History, Types, and Applications

Introduction

Attention, first introduced in 2015, quickly became a cornerstone in both NLP and computer vision, enabling models to focus on relevant information within complex inputs. The rise of self‑attention in 2017 ushered in the transformer era, dramatically improving representation learning.

History

Key milestones include:

2015 ICLR: Neural Machine Translation by Jointly Learning to Align and Translate – introduced additive (Bahdanau) attention.

2015 EMNLP: Effective Approaches to Attention‑based Neural Machine Translation – explored multiplicative (Luong) attention.

2015 ICML: Show, Attend and Tell – presented hard/soft visual attention.

2017 NIPS: Attention Is All You Need – proposed the transformer with self‑attention and multi‑head attention.

What Is Attention?

Attention can be abstracted into three functional components:

Score function – measures similarity between a query and keys.

Alignment function – normalizes scores (typically with softmax) to produce attention weights.

Context vector generation – aggregates values weighted by the attention scores.

Alignment‑Based Attention

In this view, the model receives a context c and an input y, producing an output z. The classic Bahdanau attention follows the three‑step process above.

Memory‑Based Attention (QKV Model)

This perspective treats attention as a query‑key‑value lookup: the query q searches a memory of key‑value pairs ( k, v) to retrieve relevant information.

Attention in Detail

All attention variants can be analyzed through the three‑step framework. Major families include:

Hard vs. Soft Attention – hard attention samples a single input element; soft attention computes a weighted sum.

Global vs. Local Attention – global attends to all inputs, while local restricts the attention window, often using Gaussian weighting.

Score Functions – dot product, scaled dot product, additive, multiplicative, etc.

Self‑Attention & Multi‑Head Attention

Self‑attention removes recurrence, allowing constant‑time pairwise interactions and parallel computation. Multi‑head attention runs several self‑attention heads in parallel, each learning different relational patterns.

Transformer Variants and Extensions

Beyond the original transformer, many extensions improve context handling:

Transformer‑XL – introduces segment‑level recurrence for longer contexts.

Weighted Transformer – modifies attention weighting schemes.

Hierarchical Attention Networks – apply attention at word and sentence levels for document classification.

Why Attention Works

Attention effectively performs a weighted sum, enabling models to capture salient information, improve long‑range dependencies, and parallelize computation, which explains its success across NLP, vision, and recommendation tasks.

References

Bahdanau et al., 2014. Neural Machine Translation by Jointly Learning to Align and Translate.

Luong et al., 2015. Effective Approaches to Attention‑based Neural Machine Translation.

Vaswani et al., 2017. Attention Is All You Need.

Dai et al., 2019. Transformer‑XL: Attentive Language Models Beyond a Fixed‑Length Context.

Yang et al., 2016. Hierarchical Attention Networks for Document Classification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

attention NLP

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.