How DeepSeek V4 Advances Structured Optimization in the Large‑Model Era
The article analyses DeepSeek V4’s architectural innovations—including Compressed Sparse Attention, Heavily Compressed Attention, a cross‑layer MoE design, and an Agent‑RL framework with Generative Reward Models and multi‑teacher distillation—while comparing its long‑context capabilities and efficiency to rival LLMs such as GLM, Kimi, Claude, GPT and Gemini.
Performance Boost – Long‑Context
DeepSeek V4 extends the DeepSeek Sparse Attention (DSA) of v3.2 with two mechanisms: Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Both target the 1 million‑token context bottleneck.
Compressed Sparse Attention (CSA)
CSA operates in two stages:
Compression stage: consecutive tokens are grouped and compressed into fewer “entries”.
Sparse stage: sparse attention is performed on the compressed entries.
Compression reduces memory and enables sparse selection while preserving fine‑grained detail.
Heavily Compressed Attention (HCA)
HCA applies extreme compression (e.g., 128:1) to produce a short sequence on which global dense attention is computed. HCA provides a “telescope” view of the whole context.
CSA and HCA are not alternatives; they are alternately stacked, giving both microscopic precision (CSA) and macroscopic structure (HCA).
Mixture‑of‑Experts (MoE) Architecture with Cross‑Layer Information Exchange
The MoE backbone is augmented with Heavy Compression (HC) and meta‑Heavy Compression (mHC) layers. The combination MoE + HC + mHC creates a super‑large MoE that learns simplified topologies across layers, enabling cross‑layer collaboration.
Hash routing is employed to improve expert utilization.
Agent Reinforcement Learning Pipeline
DeepSeek V4 adopts the Generative Reward Model (GRM) introduced by DeepMind in 2024, which merges RLAIF and RLHF. During mid‑training, agent data are directly incorporated, emphasizing high‑quality data.
In a specialist‑stage training phase, the model is fine‑tuned for tool‑use scenarios. The output format switches from Markdown to DeepSeek XML (DSML) to improve tool‑call accuracy.
Multi‑teacher on‑policy distillation (OPD) fuses multiple teacher models with same‑policy learning, raising the student model’s performance ceiling while retaining distillation efficiency.
Structured Optimization Pipeline
The pipeline combines GRM, multi‑teacher OPD, and the DSec sandbox (currently Python) to enable LLM‑Alpha‑Zero‑style agent evaluation and high‑quality process assessment. The core formula is:
GRM + MT‑OPD + DSec Sandbox
Comparison with Other Models (Long‑Context / MoE / Agent RL)
GLM 5.1 – uses DSA for long context.
Kimi 2.5 – extends context to 262 K tokens and employs parallel Agent Swarm.
Claude Opus 4.7 – rumored to combine parallel agents with CSA‑like global focus.
GPT 5.5 – rumored to integrate parallel agents and HCA‑like compression.
Gemini 3.1 – reportedly enhances external memory to reach 2 M tokens.
References
Technical DeepSeek article: https://magazine.sebastianraschka.com/p/technical-deepseek
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI2ML AI to Machine Learning
Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
