What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

This article systematically compares the architectures of recent large language models—including DeepSeek V3/R1, OLMo 2, Gemma 3, Mistral Small 3.1, Llama 4, Qwen 3, SmolLM 3 and Kimi 2—highlighting innovations such as MLA, MoE, post‑norm, sliding‑window attention, NoPE and optimizer choices, with diagrams and code examples to illustrate their impact on efficiency and performance.

Data Party THU
Data Party THU
Data Party THU
What Sets the Latest LLMs Apart? A Deep Dive into V3, OLMo, Gemma, Mistral, Llama 4 and More

Seven years after the original GPT architecture, many newer models (GPT‑2, DeepSeek‑V3, Llama 4, etc.) retain a similar core structure while introducing incremental yet impactful refinements.

1. DeepSeek V3/R1

1.1 Multi‑Head Latent Attention (MLA)

MLA compresses the key and value tensors before storing them in the KV cache, reducing memory usage compared with standard multi‑head attention (MHA). During inference the compressed tensors are re‑projected to their original size, adding a matrix‑multiplication step but achieving significant memory savings.

MLA vs MHA compression diagram
MLA vs MHA compression diagram

1.2 Mixture‑of‑Experts (MoE)

MoE replaces the traditional feed‑forward module with multiple expert layers; a router activates only a small subset of experts per token. For example, DeepSeek V3 contains 256 experts but activates just nine (one shared and eight selected) during inference.

MoE vs standard feed‑forward comparison
MoE vs standard feed‑forward comparison

2. OLMo 2

2.1 Placement of Normalization Layers

OLMo 2 adopts a post‑normalization (Post‑Norm) strategy, contrasting with the pre‑normalization (Pre‑Norm) used by most LLMs. It places RMSNorm after the attention and feed‑forward modules, improving training stability, especially without elaborate learning‑rate warm‑up.

Post‑Norm vs Pre‑Norm comparison
Post‑Norm vs Pre‑Norm comparison

2.2 QK‑Norm

QK‑Norm adds an extra RMSNorm layer to the query and key tensors before applying RoPE, reducing numerical instability during training.

3. Gemma 3

3.1 Sliding‑Window Attention

Sliding‑window attention limits each query’s context to a local window, drastically cutting KV‑cache memory while preserving modeling performance. Gemma 3 reduces the window size from 4096 (Gemma 2) to 1024 and adjusts the ratio of global to local attention.

KV‑cache memory savings with sliding‑window attention
KV‑cache memory savings with sliding‑window attention

3.2 Normalization Placement

Gemma 3 inserts RMSNorm layers both before and after the attention and feed‑forward modules, combining the stability benefits of post‑norm with the efficiency of pre‑norm.

OLMo 2 vs Gemma 3 architecture comparison
OLMo 2 vs Gemma 3 architecture comparison

4. Mistral Small 3.1

Mistral Small 3.1 improves inference speed by using a custom tokenizer, shrinking the KV cache, reducing the number of layers, and replacing sliding‑window attention with FlashAttention, resulting in lower latency while maintaining high performance.

Mistral Small 3.1 performance comparison
Mistral Small 3.1 performance comparison

5. Llama 4

Llama 4 follows a DeepSeek‑V3‑like backbone but refines details: it employs grouped query attention (GQA) instead of MLA, uses fewer but larger experts in its MoE modules, and alternates MoE and dense blocks within each Transformer layer.

DeepSeek V3 vs Llama 4 architecture comparison
DeepSeek V3 vs Llama 4 architecture comparison

6. Qwen 3

6.1 Dense Model

The dense variant of Qwen 3 deepens the Transformer (more blocks) while Llama 3 widens it (more attention heads). Qwen 3 therefore uses less memory but incurs slower generation speed.

Qwen 3 dense vs Llama 3 architecture
Qwen 3 dense vs Llama 3 architecture

6.2 MoE Model

Qwen 3’s MoE architecture mirrors DeepSeek V3’s design but omits shared experts, allowing the model to learn more knowledge during training while keeping inference efficient.

DeepSeek‑V3 vs Qwen 3 MoE comparison
DeepSeek‑V3 vs Qwen 3 MoE comparison

7. SmolLM 3

7.1 No Position Embedding (NoPE)

NoPE removes all absolute or rotary position embeddings, relying solely on the causal attention mask to preserve autoregressive order. This implicit positional learning improves length‑generalization, allowing SmolLM 3 to handle longer sequences with minimal performance loss.

Absolute position embedding example
Absolute position embedding example

8. Kimi 2

Kimi 2 extends the DeepSeek V3 architecture, swaps the AdamW optimizer for the Muon optimizer, increases the number of experts in its MoE module, and reduces the number of heads in its MLA module, resulting in smoother and faster‑decreasing training loss curves.

DeepSeek V3 vs Kimi 2 architecture comparison
DeepSeek V3 vs Kimi 2 architecture comparison

Code Example: Grouped Query Attention with Optional QK‑Norm

class GroupedQueryAttention(nn.Module):
    def __init__(self, d_in, num_heads, num_kv_groups,
                 head_dim=None, qk_norm=False, dtype=None):
        # ...
        if qk_norm:
            self.q_norm = RMSNorm(head_dim, eps=1e-6)
            self.k_norm = RMSNorm(head_dim, eps=1e-6)
        else:
            self.q_norm = self.k_norm = None

    def forward(self, x, mask, cos, sin):
        b, num_tokens, _ = x.shape
        # Apply projections
        queries = self.W_query(x)
        keys = self.W_key(x)
        values = self.W_value(x)
        # Optional normalization
        if self.q_norm:
            queries = self.q_norm(queries)
        if self.k_norm:
            keys = self.k_norm(keys)
        # Apply RoPE
        queries = apply_rope(queries, cos, sin)
        keys = apply_rope(keys, cos, sin)
        # Expand K and V to match number of heads
        keys = keys.repeat_interleave(self.group_size, dim=1)
        values = values.repeat_interleave(self.group_size, dim=1)
        # Attention
        attn_scores = queries @ keys.transpose(2, 3)
        # ...
architectureLLMComparisonMoESliding WindowMLANoPEpost-norm
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.