What’s Driving the Latest LLM Architecture Trends? DeepSeek, OLMo, Gemma, and More Explained
This article examines the evolution of large language model architectures over the past seven years, comparing key design choices such as Multi‑Head Latent Attention, Grouped‑Query Attention, Mixture‑of‑Experts, sliding‑window attention, normalization placement, and optimizer variants across models like DeepSeek V3, OLMo 2, Gemma 3, Llama 4, Qwen 3, SmolLM 3, and Kimi 2.
The past seven years have seen the core GPT architecture evolve from the original GPT to GPT‑2, GPT‑3, and now to models such as DeepSeek‑V3, OLMo 2, Gemma 3, Llama 4, Qwen 3, SmolLM 3, and Kimi 2. While the transformer backbone remains, many subtle design changes affect efficiency, scalability, and performance.
DeepSeek V3 / R1
DeepSeek V3 introduced two major architectural innovations that improve inference efficiency: Multi‑Head Latent Attention (MLA) and Mixture‑of‑Experts (MoE).
Multi‑Head Latent Attention (MLA)
MLA builds on Grouped‑Query Attention (GQA), which shares key/value pairs across multiple heads to reduce memory usage. Unlike GQA, MLA first compresses keys and values to a lower dimension before caching them and projects them back during inference. This reduces the KV‑cache size with only a small matrix‑multiply overhead.
Mixture‑of‑Experts (MoE)
MoE replaces each feed‑forward block with many expert sub‑layers, activating only a small subset per token. DeepSeek‑V3 contains 256 experts per layer (total 671 B parameters) but activates only nine experts per token (including a shared expert), keeping inference cost low while providing large model capacity.
Overall, MLA reduces KV‑cache usage and MoE supplies a large capacity with modest inference cost.
OLMo 2
OLMo 2, released by the Allen Institute for AI, focuses on transparent training data and code. Its notable architectural choices are the placement of RMSNorm after attention and feed‑forward modules (a Post‑Norm variant) and the introduction of QK‑Norm, an RMSNorm applied to queries and keys before RoPE.
Normalization placement
OLMo 2 uses RMSNorm after each sub‑layer (Post‑Norm) instead of the Pre‑Norm used by many GPT‑style models. This placement improves training stability when combined with QK‑Norm.
QK‑Norm
QK‑Norm normalizes query and key vectors inside the attention block before applying RoPE. The following code shows a PyTorch implementation of Grouped‑Query Attention with optional QK‑Norm.
class GroupedQueryAttention(nn.Module):
def __init__(self, d_in, num_heads, num_kv_groups, head_dim=None, qk_norm=False, dtype=None):
# ...
if qk_norm:
self.q_norm = RMSNorm(head_dim, eps=1e-6)
self.k_norm = RMSNorm(head_dim, eps=1e-6)
else:
self.q_norm = self.k_norm = None
def forward(self, x, mask, cos, sin):
queries = self.W_query(x)
keys = self.W_key(x)
values = self.W_value(x)
if self.q_norm:
queries = self.q_norm(queries)
if self.k_norm:
keys = self.k_norm(keys)
# Apply RoPE and attention computation ...QK‑Norm together with Post‑Norm improves training stability.
Gemma 3
Google’s Gemma 3 focuses on efficiency through sliding‑window attention. This local attention limits each token’s context to a moving window, dramatically reducing KV‑cache memory while preserving most modeling performance.
Gemma 3 uses a 5:1 ratio of local to global attention layers and a window size of 1024 tokens (down from 4096 in Gemma 2). Ablation studies show negligible impact on perplexity.
Normalization layout
Gemma 3 places RMSNorm both before and after the Grouped‑Query Attention block, effectively combining the benefits of Pre‑Norm and Post‑Norm without significant overhead.
Gemma 3n (small‑device variant)
Gemma 3n introduces per‑layer embedding (PLE) to keep the core transformer on GPU while loading embeddings from CPU/SSD on demand, saving memory.
Mistral Small 3.1
Mistral Small 3.1 (24 B) outperforms Gemma 3 27 B on most benchmarks while being faster, thanks to a custom tokenizer, fewer layers, and a reduced KV‑cache. It uses standard Grouped‑Query Attention without sliding‑window attention.
Llama 4
Llama 4 adopts a MoE design similar to DeepSeek V3 but with fewer activated parameters (~170 B) and uses Grouped‑Query Attention instead of MLA. The model interleaves dense and MoE blocks, offering a different trade‑off between capacity and inference cost.
Qwen 3
Qwen 3 offers both dense and MoE variants.
Dense models
The 0.6 B dense model is extremely lightweight and competitive with Llama 3 1 B, using a deeper but narrower architecture.
MoE models
The 235 B MoE model (A22B) activates about 220 B parameters during inference. Unlike earlier Qwen MoE models, it drops the shared expert, likely because scaling to eight experts no longer requires it for stability.
SmolLM 3
SmolLM 3 (3 B) achieves strong performance in the 30 B parameter range and is notable for using NoPE (no positional encoding) in every fourth layer. Causal masks provide order, allowing the model to learn positional cues implicitly. Research on NoPE shows better length‑generalization, although results are from smaller GPT‑style models.
Kimi 2
Kimi 2 (1 T parameters) matches top‑tier proprietary models and uses the Muon optimizer instead of AdamW, yielding a very smooth loss curve. Architecturally it mirrors DeepSeek V3 with a larger MoE (more experts, fewer MLA heads) and retains the shared expert.
Overall trends (2025)
The landscape in 2025 shows a convergence toward efficient attention variants (MLA, GQA, sliding‑window), strategic placement of normalization (RMSNorm, QK‑Norm), and widespread adoption of Mixture‑of‑Experts to balance capacity and inference cost.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
