From RNN to LLM: How Transformers Power Modern Language Models
This article explains the evolution from RNNs through Encoder‑Decoder models to Transformers, detailing self‑attention, multi‑head attention, and masked attention, and then describes what Large Language Models are, their key components, capabilities, limitations, and common applications.
Recurrent Neural Network (RNN)
Feature: Processes a sequence token‑by‑token, maintaining a hidden state that is updated at each step.
Limitation: Gradient propagation over long sequences is weak, making it difficult to capture long‑range dependencies and resulting in long training times.
Encoder‑Decoder Architecture
Typical tasks: Machine translation, summarisation, and other sequence‑to‑sequence problems.
Design: An Encoder converts the input sequence into a continuous latent representation; a Decoder generates the output sequence from that representation, often using attention to focus on relevant encoder states.
Transformer Architecture
Fully parallelisable computation, which dramatically speeds up training compared with recurrent models.
Excels at modelling long‑distance relationships through self‑attention.
Core components:
Self‑Attention
For each token, three vectors are learned: Query (Q), Key (K) and Value (V). The attention weight between token i and token j is the scaled dot‑product of their Q and K vectors, normalised with Softmax, and the output is a weighted sum of the V vectors.
Attention(Q,K,V)=softmax\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) VScaled Dot‑Product Attention
The division by \(\sqrt{d_k}\) (where \(d_k\) is the dimensionality of the key vectors) prevents the dot‑product values from growing too large, which would push the Softmax into regions with vanishing gradients.
Multi‑Head Attention
Instead of a single attention operation, the Transformer runs h parallel attention heads, each with its own learned linear projections of Q, K and V. The heads capture different semantic subspaces and their outputs are concatenated and linearly transformed.
Positional Encoding
Because the self‑attention mechanism is permutation‑invariant, sinusoidal or learned positional encodings are added to the token embeddings to inject order information.
Masked (Causal) Attention
During decoding, a triangular mask blocks attention to future positions, ensuring the model generates tokens autoregressively (i.e., each token can only attend to itself and earlier tokens).
Large Language Models (LLMs)
An LLM is a Transformer‑based model trained on massive text corpora to predict the next token. The objective forces the model to internalise grammar, semantics, world knowledge and reasoning patterns.
Key Elements
Parameters : Modern LLMs contain from hundreds of millions to hundreds of billions of trainable weights (e.g., GPT‑5, Claude, Gemini). Each weight encodes a minute piece of linguistic knowledge.
Training Data : Diverse sources such as Wikipedia, books, web pages, dialogues, source code, etc., are combined to teach the model language rules, semantic relations and factual knowledge.
Compute : Training typically runs for weeks or months on large GPU/TPU clusters, consuming petaflop‑scale compute.
Capabilities Learned by LLMs
Grammar : Ability to generate syntactically correct sentences.
Semantic Relations : Understanding of word meanings and their contextual distances.
World Knowledge : Extraction of factual and common‑sense information from the training corpus.
Reasoning : Performing logical inference and multi‑step problem solving when prompted.
Strengths and Limitations
Understanding : Handles complex semantics and long contexts, but can misinterpret ambiguous instructions.
Hallucination : May produce plausible‑looking statements that are factually incorrect because generation is based on learned statistical patterns rather than verified truth.
Stale Knowledge : Training data is static; without external retrieval mechanisms the model cannot provide up‑to‑date information.
Typical Applications
Chain‑of‑Thought (CoT) : Prompting the model to reason step‑by‑step, improving accuracy on reasoning‑heavy tasks.
Retrieval‑Augmented Generation (RAG) : Combining a search engine or vector database with the LLM so that external documents are retrieved and incorporated into the response.
Local Deployment (e.g., Ollama) : Running open‑source LLMs on personal hardware to preserve privacy and gain fine‑grained control.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
