Why Transformers Revolutionized NLP: From Problems to Solutions
This article explains the historical challenges of natural language processing, from rule‑based and statistical models to recurrent networks and their limitations, then introduces the Transformer architecture, its self‑attention mechanism, multi‑head attention, and supporting layers, illustrating how it overcomes previous issues and enables efficient parallel training.
1. The Rise of Artificial Intelligence
In 1950 Alan Turing published a groundbreaking paper predicting the possibility of truly intelligent machines and proposed the famous Turing test, which evaluates whether a computer can imitate human conversation without being distinguished by human judges.
2. Development of NLP
Understanding natural language is the first step for machines to perform human‑like reasoning, making NLP a crucial field.
Rule‑Based Models
Early research relied on manually crafted rules, which required extensive expert effort, could not handle unseen inputs, but performed well in specific domains such as e‑commerce customer service.
Statistical Models
In the 1980s‑1990s, models based on the Markov assumption were introduced, leading to bigram and n‑gram models. These suffered from the long‑distance dependency problem because the required probability tables grew exponentially with n.
Neural Network Models
Convolutional (CNN) and recurrent (RNN) neural networks emerged, inspired by brain mechanisms. RNNs process sequences sequentially, alleviating some long‑distance issues but introducing gradient vanishing/explosion.
RNN Gradient Problems
During back‑propagation, error signals decay multiplicatively across time steps, causing early tokens to receive near‑zero gradients, which hampers learning of long‑range dependencies.
LSTM (Long Short‑Term Memory)
LSTM adds memory cells and gated mechanisms (input, output, forget) to create a “green channel” for important information, mitigating gradient issues and enabling better handling of long contexts.
3. Transformer
What Is a Transformer?
The Transformer model, introduced by Google in the 2017 paper “Attention Is All You Need,” replaces recurrence with self‑attention, allowing parallel processing of sequences.
Word and Position Embedding
Words are mapped to high‑dimensional vectors (embeddings) that capture semantic similarity. Since Transformers lack inherent order, positional embeddings are added to encode token positions.
Self‑Attention Mechanism
Self‑attention computes Query (Q), Key (K), and Value (V) vectors for each token, calculates scaled dot‑product scores between Q and K, applies softmax to obtain attention weights, and aggregates V accordingly.
Multi‑Head Attention
Multiple self‑attention heads run in parallel, each learning different relational aspects; their outputs are concatenated and linearly transformed.
Add & Norm Layers
Residual connections preserve original information, while layer normalization stabilizes training and improves generalization.
Feed‑Forward Layer
A position‑wise feed‑forward network adds non‑linear transformation, enhancing feature representation and model capacity.
Encoder and Decoder
The encoder stacks self‑attention and feed‑forward blocks to produce contextual representations. The decoder adds masked self‑attention (preventing a token from seeing future tokens) and attends to encoder outputs, ending with a softmax layer for token prediction.
Transformer Summary
Enables parallel training unlike RNNs.
Requires positional embeddings to retain order information.
Self‑attention with Q, K, V matrices is the core component.
Multi‑head attention captures diverse relational patterns.
References
https://github.com/datawhalechina/learn-nlp-with-transformers
https://tech.dewu.com/article?id=109
https://zhuanlan.zhihu.com/p/338817680
https://arxiv.org/pdf/1706.03762
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
