Artificial Intelligence 38 min read

Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics

This article provides a comprehensive, step‑by‑step explanation of the Transformer architecture that underpins large language models, covering tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a detailed translation example, visualized attention weights, and a survey of recent open‑source LLM designs such as DeepSeek V3, OLMo 2, and Gemma 3.

Tencent Technical Engineering

Dec 3, 2025

Why Transformers Power Modern LLMs: A Deep Dive into Architecture and Mechanics

1. Transformer fundamentals

The model first tokenizes a sentence into discrete tokens (e.g., Transformer, is, powerful, .) and maps each token to a unique integer ID via a vocabulary. Each token is then projected into a high‑dimensional embedding vector. Example dimensions: GPT‑2 uses 768‑dim embeddings, GPT‑3 uses 12 288‑dim embeddings; a toy example uses a 4‑dim embedding matrix. Positional encoding (absolute or relative) is added to the token embeddings to inject order information.

2. Attention mechanism

Self‑attention computes queries (Q), keys (K) and values (V) from the same token sequence, forms attention scores by Q·Kᵀ, scales them, applies a softmax to obtain attention weights, and produces a context vector as the weighted sum of V. Causal (masked) attention prevents each token from attending to future tokens, which is essential for autoregressive language modeling. Multi‑head attention runs several self‑attention heads in parallel, allowing each head to capture different sub‑spaces (e.g., syntax vs. semantics). An example with two heads shows how the English word “powerful” aligns with the Chinese characters “强” and “大”.

3. Feed‑forward network (FFN/MLP)

Each Transformer layer contains a position‑wise feed‑forward network, typically a two‑layer MLP that expands the hidden dimension, applies a GELU activation, and projects back. Residual connections and layer‑normalization surround the attention and FFN sub‑layers.

4. Stacking Transformer layers

Multiple identical layers are stacked to increase model depth. Lower layers tend to capture lexical patterns, while higher layers encode more abstract semantic relationships. Modern large language models (LLMs) are usually decoder‑only Transformers that generate the next token conditioned on previously generated tokens.

5. Step‑by‑step translation example

The article walks through decoding the English sentence Transformer is powerful. into the Chinese sentence Transformer很强大。:

Input <START> token; the decoder attends to the encoder output and predicts Transformer.

With generated Transformer, the decoder predicts the adverb 很 by attending to the source token is.

Using the partial output Transformer 很, the decoder focuses on the encoder key for powerful and predicts 强.

The same context yields the second character 大.

Finally, the decoder emits the punctuation 。.

Attention heatmaps illustrate strong connections: Transformer ↔ Transformer (0.95 weight), is ↔ 很 (0.80), and powerful ↔ 强 / 大 (0.50 and 0.35 respectively).

6. Recent open‑source LLM architectures (2025)

DeepSeek V3 / R1 : Replaces dense feed‑forward layers with Mixture‑of‑Experts (MoE); only a few experts are activated per token. Introduces Multi‑Head Latent Attention with low‑dimensional KV caching for memory efficiency.

OLMo 2 : Places RMSNorm after attention and feed‑forward modules, adds Q‑K normalization, and later adopts Grouped‑Query Attention (GQA) in larger variants.

Gemma 3 : Uses sliding‑window attention to limit context length, combines global and local attention, and applies RMSNorm both before and after sub‑layers for stability.

MoE trend : Models such as Llama 4, Qwen 3 and many others adopt MoE layers to boost capacity while keeping inference cost low, often mixing dense and MoE blocks.

7. References

Attention Is All You Need – https://arxiv.org/pdf/1706.03762

Modular RAG – https://arxiv.org/pdf/2407.21059

DeepSeek V3 architecture – https://github.com/antgroup/llm-oss-landscape/blob/main/reports/250913_llm_landscape/250913_Slides.pdf

LLM Visualization – https://bbycroft.net/llm

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Neural Network LLM Transformer Attention Embedding

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.