Demystifying LLM Architecture: From Transformers to Modern MoE Designs
This comprehensive guide explains the fundamentals of large language model (LLM) architectures, covering the original Transformer, tokenization, embeddings, positional encoding, attention mechanisms, feed‑forward networks, layer stacking, a step‑by‑step translation example, and the latest open‑source and hybrid LLM designs shaping the field.
1. Introduction
The article provides a deep‑dive into the core concepts behind large language models (LLMs), starting from the seminal Attention Is All You Need paper and moving toward modern architectures that combine Transformers with Mixture‑of‑Experts (MoE) and other innovations.
2. LLM Architecture Overview
Current mainstream LLMs are built on a decoder‑only Transformer stack. The encoder‑decoder variant is also described for completeness, but most large models such as GPT‑3, GPT‑4, DeepSeek‑R1, and upcoming 2025 releases rely on a pure decoder architecture.
3. Transformer Basics
3.1 Tokenization
Input text is first split into tokens (e.g., <PAD>, <UNK>, <START>, <END>) and mapped to integer indices via a vocabulary.
3.2 Embedding
Each token is projected into a continuous vector space (the embedding). Example embedding vectors are shown for the sentence "Transformer is powerful.".
3.3 Positional Encoding
Because the Transformer processes tokens in parallel, a positional encoding vector of the same dimension as the embedding is added to each token to convey order information. Both absolute and relative encodings are discussed.
4. Attention Mechanism
4.1 Self‑Attention
For each token, a query (Q), key (K), and value (V) vector are computed. The attention weight is the scaled dot‑product of Q and K, passed through a softmax, and then used to weight the V vectors, producing a context vector.
4.2 Multi‑Head Attention
Multiple attention heads run in parallel, allowing the model to capture different types of relationships (e.g., syntactic vs. semantic). The heads are concatenated and linearly projected.
4.3 Causal (Masked) Attention
During generation, the model masks future tokens so that each position can only attend to previous positions, ensuring autoregressive behavior.
5. Feed‑Forward Network (FFN) / MLP
Each Transformer layer contains an FFN that expands the hidden dimension, applies a non‑linear activation (GELU), and projects back to the original size. Residual connections and layer‑norm are applied around both the attention and FFN sub‑layers.
6. Stacking Transformer Layers
Multiple identical layers are stacked. Lower layers tend to learn lexical and syntactic patterns, while higher layers capture semantic and contextual information.
7. Example: Step‑by‑Step Translation
The article walks through translating "Transformer is powerful." into Chinese "Transformer 很强大。". Each generation step ( <START>, "Transformer", "很", "强", "大", "。") is illustrated with the corresponding attention focus and Q‑K‑V calculations.
8. Current Open‑Source Flagship LLMs
Figures show the rapid evolution of open‑source models up to 2025, including DeepSeek‑V3/R1, Llama‑4, Qwen‑3, and others. Many adopt MoE or hybrid designs to improve efficiency.
9. Emerging Architectures
Beyond pure Transformers, the article mentions:
MoE‑based models that activate only a subset of expert layers per token.
Hybrid Transformer‑Mamba architectures (e.g., Tencent’s mixed‑model) that combine recurrent‑style modules with attention.
These trends aim to reduce inference cost while preserving or enhancing model capability.
10. Conclusion
Understanding the fundamentals—tokenization, embeddings, positional encoding, attention, FFN, and layer stacking—helps practitioners make informed decisions when building or fine‑tuning LLM applications. As model capabilities grow, many current application‑level workarounds (e.g., RAG, prompt engineering) may become unnecessary.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
