7 min read

How DeepSeek V4 Advances Structured Optimization in the Large‑Model Era

The article analyses DeepSeek V4’s architectural innovations—including Compressed Sparse Attention, Heavily Compressed Attention, a cross‑layer MoE design, and an Agent‑RL framework with Generative Reward Models and multi‑teacher distillation—while comparing its long‑context capabilities and efficiency to rival LLMs such as GLM, Kimi, Claude, GPT and Gemini.

AI2ML AI to Machine Learning

Apr 25, 2026

How DeepSeek V4 Advances Structured Optimization in the Large‑Model Era

Performance Boost – Long‑Context

DeepSeek V4 extends the DeepSeek Sparse Attention (DSA) of v3.2 with two mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Both target the 1 million‑token context bottleneck.

Compressed Sparse Attention (CSA)

CSA operates in two stages:

Compression stage: consecutive tokens are grouped and compressed into fewer “entries”.

Sparse stage: sparse attention is performed on the compressed entries.

Compression reduces memory and enables sparse selection while preserving fine‑grained detail.

Heavily Compressed Attention (HCA)

HCA applies extreme compression (e.g., 128:1) to produce a short sequence on which global dense attention is computed. HCA provides a “telescope” view of the whole context.

CSA and HCA are not alternatives; they are alternately stacked, giving both microscopic precision (CSA) and macroscopic structure (HCA).

Mixture‑of‑Experts (MoE) Architecture with Cross‑Layer Information Exchange

The MoE backbone is augmented with Heavy Compression (HC) and meta‑Heavy Compression (mHC) layers. The combination MoE + HC + mHC creates a super‑large MoE that learns simplified topologies across layers, enabling cross‑layer collaboration.

Hash routing is employed to improve expert utilization.

Agent Reinforcement Learning Pipeline

DeepSeek V4 adopts the Generative Reward Model (GRM) introduced by DeepMind in 2024, which merges RLAIF and RLHF. During mid‑training, agent data are directly incorporated, emphasizing high‑quality data.

In a specialist‑stage training phase, the model is fine‑tuned for tool‑use scenarios. The output format switches from Markdown to DeepSeek XML (DSML) to improve tool‑call accuracy.

Multi‑teacher on‑policy distillation (OPD) fuses multiple teacher models with same‑policy learning, raising the student model’s performance ceiling while retaining distillation efficiency.

Structured Optimization Pipeline

The pipeline combines GRM, multi‑teacher OPD, and the DSec sandbox (currently Python) to enable LLM‑Alpha‑Zero‑style agent evaluation and high‑quality process assessment. The core formula is:

GRM + MT‑OPD + DSec Sandbox

Comparison with Other Models (Long‑Context / MoE / Agent RL)

GLM 5.1 – uses DSA for long context.

Kimi 2.5 – extends context to 262 K tokens and employs parallel Agent Swarm.

Claude Opus 4.7 – rumored to combine parallel agents with CSA‑like global focus.

GPT 5.5 – rumored to integrate parallel agents and HCA‑like compression.

Gemini 3.1 – reportedly enhances external memory to reach 2 M tokens.

References

Technical DeepSeek article: https://magazine.sebastianraschka.com/p/technical-deepseek

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts DeepSeek V4 Generative Reward Model Compressed Sparse Attention Agent Reinforcement Learning Heavily Compressed Attention

Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.