What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More
This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.
Seven years after the original GPT architecture, many newer models (GPT‑2, DeepSeek‑V3, Llama 4, etc.) retain a similar core structure while introducing incremental yet impactful refinements.
1. DeepSeek V3/R1
1.1 Multi‑Head Latent Attention (MLA)
MLA compresses the key and value tensors before storing them in the KV cache, reducing memory usage compared with standard multi‑head attention (MHA). During inference the compressed tensors are re‑projected to their original size, adding a matrix‑multiplication step but achieving significant memory savings.
1.2 Mixture‑of‑Experts (MoE)
MoE replaces the traditional feed‑forward module with multiple expert layers; a router activates only a small subset of experts per token. For example, DeepSeek V3 contains 256 experts but activates just nine (one shared and eight selected) during inference.
2. OLMo 2
2.1 Placement of Normalization Layers
OLMo 2 adopts a post‑normalization (Post‑Norm) strategy, contrasting with the pre‑normalization (Pre‑Norm) used by most LLMs. It places RMSNorm after the attention and feed‑forward modules, improving training stability, especially without elaborate learning‑rate warm‑up.
2.2 QK‑Norm
QK‑Norm adds an extra RMSNorm layer to the query and key tensors before applying RoPE, reducing numerical instability during training.
3. Gemma 3
3.1 Sliding‑Window Attention
Sliding‑window attention limits each query’s context to a local window, drastically cutting KV‑cache memory while preserving modeling performance. Gemma 3 reduces the window size from 4096 (Gemma 2) to 1024 and adjusts the ratio of global to local attention.
3.2 Normalization Placement
Gemma 3 inserts RMSNorm layers both before and after the attention and feed‑forward modules, combining the stability benefits of post‑norm with the efficiency of pre‑norm.
4. Mistral Small 3.1
Mistral Small 3.1 improves inference speed by using a custom tokenizer, shrinking the KV cache, reducing the number of layers, and replacing sliding‑window attention with FlashAttention, resulting in lower latency while maintaining high performance.
5. Llama 4
Llama 4 follows a DeepSeek‑V3‑like backbone but refines details: it employs grouped query attention (GQA) instead of MLA, uses fewer but larger experts in its MoE modules, and alternates MoE and dense blocks within each Transformer layer.
6. Qwen 3
6.1 Dense Model
The dense variant of Qwen 3 deepens the Transformer (more blocks) while Llama 3 widens it (more attention heads). Qwen 3 therefore uses less memory but incurs slower generation speed.
6.2 MoE Model
Qwen 3’s MoE architecture mirrors DeepSeek V3’s design but omits shared experts, allowing the model to learn more knowledge during training while keeping inference efficient.
7. SmolLM 3
7.1 No Position Embedding (NoPE)
NoPE removes all absolute or rotary position embeddings, relying solely on the causal attention mask to preserve autoregressive order. This implicit positional learning improves length‑generalization, allowing SmolLM 3 to handle longer sequences with minimal performance loss.
8. Kimi 2
Kimi 2 extends the DeepSeek V3 architecture, swaps the AdamW optimizer for the Muon optimizer, increases the number of experts in its MoE module, and reduces the number of heads in its MLA module, resulting in smoother and faster‑decreasing training loss curves.
Code Example: Grouped Query Attention with Optional QK‑Norm
class GroupedQueryAttention(nn.Module):
def __init__(self, d_in, num_heads, num_kv_groups,
head_dim=None, qk_norm=False, dtype=None):
# ...
if qk_norm:
self.q_norm = RMSNorm(head_dim, eps=1e-6)
self.k_norm = RMSNorm(head_dim, eps=1e-6)
else:
self.q_norm = self.k_norm = None
def forward(self, x, mask, cos, sin):
b, num_tokens, _ = x.shape
# Apply projections
queries = self.W_query(x)
keys = self.W_key(x)
values = self.W_value(x)
# Optional normalization
if self.q_norm:
queries = self.q_norm(queries)
if self.k_norm:
keys = self.k_norm(keys)
# Apply RoPE
queries = apply_rope(queries, cos, sin)
keys = apply_rope(keys, cos, sin)
# Expand K and V to match number of heads
keys = keys.repeat_interleave(self.group_size, dim=1)
values = values.repeat_interleave(self.group_size, dim=1)
# Attention
attn_scores = queries @ keys.transpose(2, 3)
# ...Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
