Artificial Intelligence 16 min read

Why Transformers Revolutionized NLP: From Problems to Solutions

This article explains the historical challenges of natural language processing, from rule‑based and statistical models to recurrent networks and their limitations, then introduces the Transformer architecture, its self‑attention mechanism, multi‑head attention, and supporting layers, illustrating how it overcomes previous issues and enables efficient parallel training.

Alibaba Cloud Developer

Mar 10, 2025

Why Transformers Revolutionized NLP: From Problems to Solutions

1. The Rise of Artificial Intelligence

In 1950 Alan Turing published a groundbreaking paper predicting the possibility of truly intelligent machines and proposed the famous Turing test, which evaluates whether a computer can imitate human conversation without being distinguished by human judges.

2. Development of NLP

Understanding natural language is the first step for machines to perform human‑like reasoning, making NLP a crucial field.

Rule‑Based Models

Early research relied on manually crafted rules, which required extensive expert effort, could not handle unseen inputs, but performed well in specific domains such as e‑commerce customer service.

Statistical Models

In the 1980s‑1990s, models based on the Markov assumption were introduced, leading to bigram and n‑gram models. These suffered from the long‑distance dependency problem because the required probability tables grew exponentially with n.

Neural Network Models

Convolutional (CNN) and recurrent (RNN) neural networks emerged, inspired by brain mechanisms. RNNs process sequences sequentially, alleviating some long‑distance issues but introducing gradient vanishing/explosion.

RNN Gradient Problems

During back‑propagation, error signals decay multiplicatively across time steps, causing early tokens to receive near‑zero gradients, which hampers learning of long‑range dependencies.

LSTM (Long Short‑Term Memory)

LSTM adds memory cells and gated mechanisms (input, output, forget) to create a “green channel” for important information, mitigating gradient issues and enabling better handling of long contexts.

3. Transformer

What Is a Transformer?

The Transformer model, introduced by Google in the 2017 paper “Attention Is All You Need,” replaces recurrence with self‑attention, allowing parallel processing of sequences.

Word and Position Embedding

Words are mapped to high‑dimensional vectors (embeddings) that capture semantic similarity. Since Transformers lack inherent order, positional embeddings are added to encode token positions.

Self‑Attention Mechanism

Self‑attention computes Query (Q), Key (K), and Value (V) vectors for each token, calculates scaled dot‑product scores between Q and K, applies softmax to obtain attention weights, and aggregates V accordingly.

Multi‑Head Attention

Multiple self‑attention heads run in parallel, each learning different relational aspects; their outputs are concatenated and linearly transformed.

Add & Norm Layers

Residual connections preserve original information, while layer normalization stabilizes training and improves generalization.

Feed‑Forward Layer

A position‑wise feed‑forward network adds non‑linear transformation, enhancing feature representation and model capacity.

Encoder and Decoder

The encoder stacks self‑attention and feed‑forward blocks to produce contextual representations. The decoder adds masked self‑attention (preventing a token from seeing future tokens) and attends to encoder outputs, ending with a softmax layer for token prediction.

Transformer Summary

Enables parallel training unlike RNNs.

Requires positional embeddings to retain order information.

Self‑attention with Q, K, V matrices is the core component.

Multi‑head attention captures diverse relational patterns.

References

https://github.com/datawhalechina/learn-nlp-with-transformers

https://tech.dewu.com/article?id=109

https://zhuanlan.zhihu.com/p/338817680

https://arxiv.org/pdf/1706.03762

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence Transformer NLP Self-Attention

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.