From RNN to LLM: How Transformers Power Modern Language Models

This article explains the evolution from RNNs through Encoder‑Decoder models to Transformers, detailing self‑attention, multi‑head attention, and masked attention, and then describes what Large Language Models are, their key components, capabilities, limitations, and common applications.

Data Party THU
Data Party THU
Data Party THU
From RNN to LLM: How Transformers Power Modern Language Models

Recurrent Neural Network (RNN)

Feature: Processes a sequence token‑by‑token, maintaining a hidden state that is updated at each step.

Limitation: Gradient propagation over long sequences is weak, making it difficult to capture long‑range dependencies and resulting in long training times.

Encoder‑Decoder Architecture

Typical tasks: Machine translation, summarisation, and other sequence‑to‑sequence problems.

Design: An Encoder converts the input sequence into a continuous latent representation; a Decoder generates the output sequence from that representation, often using attention to focus on relevant encoder states.

Encoder‑Decoder Diagram
Encoder‑Decoder Diagram

Transformer Architecture

Fully parallelisable computation, which dramatically speeds up training compared with recurrent models.

Excels at modelling long‑distance relationships through self‑attention.

Core components:

Self‑Attention

For each token, three vectors are learned: Query (Q), Key (K) and Value (V). The attention weight between token i and token j is the scaled dot‑product of their Q and K vectors, normalised with Softmax, and the output is a weighted sum of the V vectors.

Attention(Q,K,V)=softmax\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V

Scaled Dot‑Product Attention

The division by \(\sqrt{d_k}\) (where \(d_k\) is the dimensionality of the key vectors) prevents the dot‑product values from growing too large, which would push the Softmax into regions with vanishing gradients.

Scaled Dot‑Product Attention
Scaled Dot‑Product Attention

Multi‑Head Attention

Instead of a single attention operation, the Transformer runs h parallel attention heads, each with its own learned linear projections of Q, K and V. The heads capture different semantic subspaces and their outputs are concatenated and linearly transformed.

Positional Encoding

Because the self‑attention mechanism is permutation‑invariant, sinusoidal or learned positional encodings are added to the token embeddings to inject order information.

Masked (Causal) Attention

During decoding, a triangular mask blocks attention to future positions, ensuring the model generates tokens autoregressively (i.e., each token can only attend to itself and earlier tokens).

Large Language Models (LLMs)

An LLM is a Transformer‑based model trained on massive text corpora to predict the next token. The objective forces the model to internalise grammar, semantics, world knowledge and reasoning patterns.

Key Elements

Parameters : Modern LLMs contain from hundreds of millions to hundreds of billions of trainable weights (e.g., GPT‑5, Claude, Gemini). Each weight encodes a minute piece of linguistic knowledge.

Training Data : Diverse sources such as Wikipedia, books, web pages, dialogues, source code, etc., are combined to teach the model language rules, semantic relations and factual knowledge.

Compute : Training typically runs for weeks or months on large GPU/TPU clusters, consuming petaflop‑scale compute.

Capabilities Learned by LLMs

Grammar : Ability to generate syntactically correct sentences.

Semantic Relations : Understanding of word meanings and their contextual distances.

World Knowledge : Extraction of factual and common‑sense information from the training corpus.

Reasoning : Performing logical inference and multi‑step problem solving when prompted.

Strengths and Limitations

Understanding : Handles complex semantics and long contexts, but can misinterpret ambiguous instructions.

Hallucination : May produce plausible‑looking statements that are factually incorrect because generation is based on learned statistical patterns rather than verified truth.

Stale Knowledge : Training data is static; without external retrieval mechanisms the model cannot provide up‑to‑date information.

Typical Applications

Chain‑of‑Thought (CoT) : Prompting the model to reason step‑by‑step, improving accuracy on reasoning‑heavy tasks.

Retrieval‑Augmented Generation (RAG) : Combining a search engine or vector database with the LLM so that external documents are retrieved and incorporated into the response.

Local Deployment (e.g., Ollama) : Running open‑source LLMs on personal hardware to preserve privacy and gain fine‑grained control.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIDeep LearningLLMTransformerattentionlarge language model
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.