What’s Driving the Latest LLM Architecture Trends? DeepSeek, OLMo, Gemma, and More Explained

This article examines the evolution of large language model architectures over the past seven years, comparing key design choices such as Multi‑Head Latent Attention, Grouped‑Query Attention, Mixture‑of‑Experts, sliding‑window attention, normalization placement, and optimizer variants across models like DeepSeek V3, OLMo 2, Gemma 3, Llama 4, Qwen 3, SmolLM 3, and Kimi 2.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
What’s Driving the Latest LLM Architecture Trends? DeepSeek, OLMo, Gemma, and More Explained

The past seven years have seen the core GPT architecture evolve from the original GPT to GPT‑2, GPT‑3, and now to models such as DeepSeek‑V3, OLMo 2, Gemma 3, Llama 4, Qwen 3, SmolLM 3, and Kimi 2. While the transformer backbone remains, many subtle design changes affect efficiency, scalability, and performance.

DeepSeek V3 / R1

DeepSeek V3 introduced two major architectural innovations that improve inference efficiency: Multi‑Head Latent Attention (MLA) and Mixture‑of‑Experts (MoE).

Multi‑Head Latent Attention (MLA)

MLA builds on Grouped‑Query Attention (GQA), which shares key/value pairs across multiple heads to reduce memory usage. Unlike GQA, MLA first compresses keys and values to a lower dimension before caching them and projects them back during inference. This reduces the KV‑cache size with only a small matrix‑multiply overhead.

MLA vs MHA diagram
MLA vs MHA diagram

Mixture‑of‑Experts (MoE)

MoE replaces each feed‑forward block with many expert sub‑layers, activating only a small subset per token. DeepSeek‑V3 contains 256 experts per layer (total 671 B parameters) but activates only nine experts per token (including a shared expert), keeping inference cost low while providing large model capacity.

MoE experiment results
MoE experiment results

Overall, MLA reduces KV‑cache usage and MoE supplies a large capacity with modest inference cost.

OLMo 2

OLMo 2, released by the Allen Institute for AI, focuses on transparent training data and code. Its notable architectural choices are the placement of RMSNorm after attention and feed‑forward modules (a Post‑Norm variant) and the introduction of QK‑Norm, an RMSNorm applied to queries and keys before RoPE.

Normalization placement

OLMo 2 uses RMSNorm after each sub‑layer (Post‑Norm) instead of the Pre‑Norm used by many GPT‑style models. This placement improves training stability when combined with QK‑Norm.

Normalization comparison
Normalization comparison

QK‑Norm

QK‑Norm normalizes query and key vectors inside the attention block before applying RoPE. The following code shows a PyTorch implementation of Grouped‑Query Attention with optional QK‑Norm.

class GroupedQueryAttention(nn.Module):
    def __init__(self, d_in, num_heads, num_kv_groups, head_dim=None, qk_norm=False, dtype=None):
        # ...
        if qk_norm:
            self.q_norm = RMSNorm(head_dim, eps=1e-6)
            self.k_norm = RMSNorm(head_dim, eps=1e-6)
        else:
            self.q_norm = self.k_norm = None

    def forward(self, x, mask, cos, sin):
        queries = self.W_query(x)
        keys = self.W_key(x)
        values = self.W_value(x)
        if self.q_norm:
            queries = self.q_norm(queries)
        if self.k_norm:
            keys = self.k_norm(keys)
        # Apply RoPE and attention computation ...

QK‑Norm together with Post‑Norm improves training stability.

Gemma 3

Google’s Gemma 3 focuses on efficiency through sliding‑window attention. This local attention limits each token’s context to a moving window, dramatically reducing KV‑cache memory while preserving most modeling performance.

Sliding‑window attention reduces KV‑cache usage
Sliding‑window attention reduces KV‑cache usage

Gemma 3 uses a 5:1 ratio of local to global attention layers and a window size of 1024 tokens (down from 4096 in Gemma 2). Ablation studies show negligible impact on perplexity.

Normalization layout

Gemma 3 places RMSNorm both before and after the Grouped‑Query Attention block, effectively combining the benefits of Pre‑Norm and Post‑Norm without significant overhead.

Architecture comparison of OLMo 2 and Gemma 3
Architecture comparison of OLMo 2 and Gemma 3

Gemma 3n (small‑device variant)

Gemma 3n introduces per‑layer embedding (PLE) to keep the core transformer on GPU while loading embeddings from CPU/SSD on demand, saving memory.

PLE memory savings
PLE memory savings

Mistral Small 3.1

Mistral Small 3.1 (24 B) outperforms Gemma 3 27 B on most benchmarks while being faster, thanks to a custom tokenizer, fewer layers, and a reduced KV‑cache. It uses standard Grouped‑Query Attention without sliding‑window attention.

Architecture comparison of Gemma 3 and Mistral 3.1 Small
Architecture comparison of Gemma 3 and Mistral 3.1 Small

Llama 4

Llama 4 adopts a MoE design similar to DeepSeek V3 but with fewer activated parameters (~170 B) and uses Grouped‑Query Attention instead of MLA. The model interleaves dense and MoE blocks, offering a different trade‑off between capacity and inference cost.

DeepSeek V3 vs Llama 4 architecture
DeepSeek V3 vs Llama 4 architecture

Qwen 3

Qwen 3 offers both dense and MoE variants.

Dense models

The 0.6 B dense model is extremely lightweight and competitive with Llama 3 1 B, using a deeper but narrower architecture.

Qwen 3 0.6 B vs Llama 3 1 B architecture
Qwen 3 0.6 B vs Llama 3 1 B architecture

MoE models

The 235 B MoE model (A22B) activates about 220 B parameters during inference. Unlike earlier Qwen MoE models, it drops the shared expert, likely because scaling to eight experts no longer requires it for stability.

DeepSeek V3 vs Qwen 3 235 B‑A22B architecture
DeepSeek V3 vs Qwen 3 235 B‑A22B architecture

SmolLM 3

SmolLM 3 (3 B) achieves strong performance in the 30 B parameter range and is notable for using NoPE (no positional encoding) in every fourth layer. Causal masks provide order, allowing the model to learn positional cues implicitly. Research on NoPE shows better length‑generalization, although results are from smaller GPT‑style models.

Qwen 3 4 B vs SmolLM 3 3 B architecture
Qwen 3 4 B vs SmolLM 3 3 B architecture
NoPE improves length generalization
NoPE improves length generalization

Kimi 2

Kimi 2 (1 T parameters) matches top‑tier proprietary models and uses the Muon optimizer instead of AdamW, yielding a very smooth loss curve. Architecturally it mirrors DeepSeek V3 with a larger MoE (more experts, fewer MLA heads) and retains the shared expert.

DeepSeek V3 vs Kimi 2 architecture
DeepSeek V3 vs Kimi 2 architecture

Overall trends (2025)

The landscape in 2025 shows a convergence toward efficient attention variants (MLA, GQA, sliding‑window), strategic placement of normalization (RMSNorm, QK‑Norm), and widespread adoption of Mixture‑of‑Experts to balance capacity and inference cost.

large language modelsMixture of ExpertsAI researchAttention Mechanismsmodel architectureLLM comparison
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.