12 min read

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.

Data Party THU

Aug 11, 2025

What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

Seven years after the original GPT architecture, many newer models (GPT‑2, DeepSeek‑V3, Llama 4, etc.) retain a similar core structure while introducing incremental yet impactful refinements.

1. DeepSeek V3/R1

1.1 Multi‑Head Latent Attention (MLA)

MLA compresses the key and value tensors before storing them in the KV cache, reducing memory usage compared with standard multi‑head attention (MHA). During inference the compressed tensors are re‑projected to their original size, adding a matrix‑multiplication step but achieving significant memory savings.

1.2 Mixture‑of‑Experts (MoE)

MoE replaces the traditional feed‑forward module with multiple expert layers; a router activates only a small subset of experts per token. For example, DeepSeek V3 contains 256 experts but activates just nine (one shared and eight selected) during inference.

2. OLMo 2

2.1 Placement of Normalization Layers

OLMo 2 adopts a post‑normalization (Post‑Norm) strategy, contrasting with the pre‑normalization (Pre‑Norm) used by most LLMs. It places RMSNorm after the attention and feed‑forward modules, improving training stability, especially without elaborate learning‑rate warm‑up.

2.2 QK‑Norm

QK‑Norm adds an extra RMSNorm layer to the query and key tensors before applying RoPE, reducing numerical instability during training.

3. Gemma 3

3.1 Sliding‑Window Attention

Sliding‑window attention limits each query’s context to a local window, drastically cutting KV‑cache memory while preserving modeling performance. Gemma 3 reduces the window size from 4096 (Gemma 2) to 1024 and adjusts the ratio of global to local attention.

KV‑cache memory savings with sliding‑window attention

3.2 Normalization Placement

Gemma 3 inserts RMSNorm layers both before and after the attention and feed‑forward modules, combining the stability benefits of post‑norm with the efficiency of pre‑norm.

OLMo 2 vs Gemma 3 architecture comparison

4. Mistral Small 3.1

Mistral Small 3.1 improves inference speed by using a custom tokenizer, shrinking the KV cache, reducing the number of layers, and replacing sliding‑window attention with FlashAttention, resulting in lower latency while maintaining high performance.

Mistral Small 3.1 performance comparison

5. Llama 4

Llama 4 follows a DeepSeek‑V3‑like backbone but refines details: it employs grouped query attention (GQA) instead of MLA, uses fewer but larger experts in its MoE modules, and alternates MoE and dense blocks within each Transformer layer.

DeepSeek V3 vs Llama 4 architecture comparison

6. Qwen 3

6.1 Dense Model

The dense variant of Qwen 3 deepens the Transformer (more blocks) while Llama 3 widens it (more attention heads). Qwen 3 therefore uses less memory but incurs slower generation speed.

6.2 MoE Model

Qwen 3’s MoE architecture mirrors DeepSeek V3’s design but omits shared experts, allowing the model to learn more knowledge during training while keeping inference efficient.

7. SmolLM 3

7.1 No Position Embedding (NoPE)

NoPE removes all absolute or rotary position embeddings, relying solely on the causal attention mask to preserve autoregressive order. This implicit positional learning improves length‑generalization, allowing SmolLM 3 to handle longer sequences with minimal performance loss.

8. Kimi 2

Kimi 2 extends the DeepSeek V3 architecture, swaps the AdamW optimizer for the Muon optimizer, increases the number of experts in its MoE module, and reduces the number of heads in its MLA module, resulting in smoother and faster‑decreasing training loss curves.

DeepSeek V3 vs Kimi 2 architecture comparison

Code Example: Grouped Query Attention with Optional QK‑Norm

class GroupedQueryAttention(nn.Module):
    def __init__(self, d_in, num_heads, num_kv_groups,
                 head_dim=None, qk_norm=False, dtype=None):
        # ...
        if qk_norm:
            self.q_norm = RMSNorm(head_dim, eps=1e-6)
            self.k_norm = RMSNorm(head_dim, eps=1e-6)
        else:
            self.q_norm = self.k_norm = None

    def forward(self, x, mask, cos, sin):
        b, num_tokens, _ = x.shape
        # Apply projections
        queries = self.W_query(x)
        keys = self.W_key(x)
        values = self.W_value(x)
        # Optional normalization
        if self.q_norm:
            queries = self.q_norm(queries)
        if self.k_norm:
            keys = self.k_norm(keys)
        # Apply RoPE
        queries = apply_rope(queries, cos, sin)
        keys = apply_rope(keys, cos, sin)
        # Expand K and V to match number of heads
        keys = keys.repeat_interleave(self.group_size, dim=1)
        values = values.repeat_interleave(self.group_size, dim=1)
        # Attention
        attn_scores = queries @ keys.transpose(2, 3)
        # ...

architecture LLM Comparison MoE Sliding Window MLA NoPE post-norm

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. DeepSeek V3/R1

1.1 Multi‑Head Latent Attention (MLA)

1.2 Mixture‑of‑Experts (MoE)

2. OLMo 2

2.1 Placement of Normalization Layers

2.2 QK‑Norm

3. Gemma 3

3.1 Sliding‑Window Attention

3.2 Normalization Placement

4. Mistral Small 3.1

5. Llama 4

6. Qwen 3

6.1 Dense Model

6.2 MoE Model

7. SmolLM 3

7.1 No Position Embedding (NoPE)

8. Kimi 2

Code Example: Grouped Query Attention with Optional QK‑Norm

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

2. OLMo 2

3. Gemma 3

4. Mistral Small 3.1

5. Llama 4

6. Qwen 3

7. SmolLM 3

8. Kimi 2