Dissecting Gemma‑4’s Architecture and Training Choices: A Technical Comparison with Qwen‑3 and GLM‑5
This article breaks down every architectural and training decision behind Gemma‑4—KV sharing, p‑RoPE, per‑layer embeddings, and a dual‑path MoE + dense MLP—while contrasting its efficiency and performance with Qwen‑3 and GLM‑5 across benchmarks, quantization strategies, and RL pipelines.
Core Insight
Gemma‑4 demonstrates that careful architectural efficiency and high‑quality training can dramatically reduce the parameter budget needed for math‑reasoning and programming tasks: a 31B dense model matches or exceeds the performance of 200B‑plus models, and the 26B‑A4B variant achieves 97% of the 31B dense performance with only 3.8B active parameters.
Architecture
KV Sharing for Edge Models
In the edge variants (E2B, E4B), the last N layers reuse KV tensors from earlier layers, eliminating redundant K/V projections. For example, E2B (35 layers) shares KV in 20 layers ( num_kv_shared_layers=20), while the full‑size 31B/26B‑A4B models keep KV sharing disabled ( num_kv_shared_layers=0) because the information gain outweighs the memory cost at larger scales.
Five‑Stage Global‑Attention Compression
The global‑attention layers undergo a chain of optimizations:
GQA compression from 16 KV heads to 4 (31B) or 8 to 2 (26B‑A4B), achieving an 8:1 ratio.
Key dimension is doubled ( global_head_dim=512) to compensate for fewer heads.
Key = Value ( attention_k_eq_v=True) halves KV cache size and acts as a regularizer.
p‑RoPE applies rotary encoding to only the top 25 % of dimensions, preserving low‑frequency semantics.
The final layer is forced to be global attention, guaranteeing full‑context visibility.
The design philosophy is to minimize global‑attention cost while retaining sufficient information.
Per‑Layer Embeddings (PLE)
Each decoder layer maintains its own small embedding table, turning a 5.1B‑parameter model into an effective 2.3B‑parameter compute graph; the extra 2.8B parameters reside in storage‑only embeddings, trading memory for compute.
MoE + Dense MLP Dual Path
The 26B‑A4B model combines a standard dense MLP ( intermediate_size=2112) with a routed MoE block of 128 experts (each moe_intermediate_size=704, selecting 8 per token). This hybrid provides a stable, non‑routed signal alongside the flexible MoE routing, unlike Qwen‑3’s pure MoE or GLM‑5’s mixed expert design.
Training
Distillation Strategies
All three teams employ distillation, but the teacher’s capacity differs:
Qwen‑3 uses a 235B‑A22B teacher (strong‑to‑weak distillation, both off‑policy and on‑policy).
GLM‑5 applies on‑policy cross‑stage distillation across reasoning, agentic, and general RL phases to avoid catastrophic forgetting.
Gemma‑4’s pipeline is based on Gemini 3 research; although the exact data and scale are undisclosed, it follows a chain‑of‑thought distillation workflow.
Training Pipelines
Each model follows a three‑stage pre‑training schedule, with different emphases:
Qwen‑3 (36 T tokens) : General stage (30 T, 4K context), Reasoning stage (5 T, more STEM/code), Long‑Context stage (thousands of T, 4K→32K context) using ABF, YARN, and Dual Chunk Attention.
GLM‑5 (28.5 T tokens) : Base training (27 T, code & reasoning focus), Mid‑training (4K→200K context expansion for agentic data), Post‑training with sequential RL (Reasoning → Agentic → General).
Gemma‑4 : Teacher‑driven chain‑of‑thought distillation; exact token count not released.
Quantization
Gemma‑4 provides a Quantization‑Aware Training (QAT) checkpoint (NVFP4 4‑bit float) that introduces quantization noise during training, yielding minimal accuracy loss. Qwen‑3 and GLM‑5 rely on post‑training quantization (GPTQ/AWQ) and, for GLM‑5, an official FP8 weight set.
Multimodal Support
Gemma‑4 integrates a ViT‑based visual encoder with 2D RoPE and a conformer‑based audio encoder, training them jointly with text. GLM‑5 accesses vision/audio via tool‑calling to dedicated models, while Qwen‑3 delegates multimodal tasks to a separate VL series.
Benchmark Comparison
Key results (publicly verifiable numbers) show:
Gemma‑4 31B: AIME 2026 89.2 %, MMLU Pro 85.2 %.
Gemma‑4 26B‑A4B: AIME 2026 88.3 % (0.9 pp lower than 31B), MMLU Pro 82.6 % with only 3.8 B active parameters.
Qwen‑3 235B‑A22B: AIME 2024 85.7 % / 2025 81.5 %.
GLM‑5 744B: AIME 2025 93.3 %, MMLU Pro 80.6 %, SWE‑bench 77.8 % (open‑source SOTA on agentic tasks).
Observations:
GLM‑5 leads on complex planning benchmarks, confirming that total parameter count still correlates with agentic capability.
Gemma‑4 26B‑A4B sacrifices <1 % performance for an 8× reduction in active parameters, highlighting the efficiency of its architectural choices.
Takeaways
“Small model + large teacher” distillation yields strong performance but is capped by the teacher’s closed‑source capabilities.
RL engineering (e.g., GLM‑5’s slime asynchronous framework) can create noticeable gaps in agentic benchmarks.
Choosing a model depends on deployment constraints: Gemma‑4 prioritizes inference efficiency, Qwen‑3 offers a balanced, well‑supported stack, and GLM‑5 excels in long‑context agentic scenarios.
Conclusion
Gemma‑4’s suite of efficiency‑focused techniques proves that parameter‑efficient LLMs can rival much larger dense models on reasoning and coding tasks, yet the ceiling for complex planning still favors massive models like GLM‑5. Understanding the trade‑offs of each architectural and training choice is more valuable than merely chasing raw parameter counts.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
