DeepSeek V4 Unveiled: 1M‑Token Context and New Architecture Challenge Closed‑Source LLMs
DeepSeek V4 introduces two flagship models—V4‑Pro with 1.6 T parameters and V4‑Flash with 284 B parameters—offering million‑token context, mixed attention (CSA + HCA), manifold‑constrained residuals, and the Muon optimizer, delivering open‑source performance that rivals top closed‑source LLMs while cutting inference cost dramatically.
Model variants
DeepSeek V4 series includes two models: V4‑Pro with 1.6 T total parameters and 49 B activation parameters, and V4‑Flash with 284 B total parameters and 13 B activation parameters.
Performance benchmarks
Agentic coding ability surpasses Sonnet 4.5 and approaches Opus 4.6.
World‑knowledge scores narrow the gap to Gemini‑Pro‑3.1.
Logical‑reasoning on mathematics, STEM and competitive coding outperforms other open‑source models.
Core architectural innovations
Mixed attention (CSA + HCA) : Compressed Sparse Attention (CSA) compresses the KV cache by a factor of 4, merging every four tokens and applying DSA sparse attention; a Lightning Indexer scores compressed entries in FP4 precision and selects the top‑1024 per query token. Hierarchical Compressed Attention (HCA) further compresses by 128× without sparse selection, preserving a global view. This “long‑short” strategy reduces compute and memory for 1 M‑token contexts.
Manifold‑constrained residual connections (mHC) : Residual matrices are projected onto the Birkhoff polytope using 20 Sinkhorn‑Knopp iterations, keeping the spectral norm ≤ 1. The projection adds ~6.7 % overhead.
Muon optimizer : Orthogonalizes gradient momentum via Newton‑Schulz iterations (10 mixed iterations: 8 fast‑converging, 2 for fine stability).
Training methodology
On‑Policy Distillation (OPD) replaces the previous RL mix: domain‑specific experts (math, code, agent) are trained separately, then a student model distills logits from dozens of experts using only final hidden states; logits are recomputed on‑the‑fly with a TileLang kernel.
Stability techniques:
Anticipatory Routing decouples routing‑index computation from the backbone and triggers on loss spikes.
SwiGLU Clamping caps linear components to [‑10, 10] and gate values to 10.
Generative Reward Model lets the actor network serve as its own reward evaluator.
FP4‑aware quantization is applied to MoE expert weights and CSA indexer; FP4→FP8 de‑quantization is lossless.
Inference efficiency
Per‑token compute of V4‑Pro is 27 % of its predecessor. KV cache usage drops to ~2 % of a BF16 GQA8 baseline. Mixed‑precision storage uses BF16 for RoPE and FP8 for other tensors, halving memory volume. The system separates compressed KV from sliding‑window KV, supports disk‑level caching, and avoids redundant prefills.
V4‑Flash trade‑offs
V4‑Flash retains comparable agent performance with lower parameter budget and reduced API cost, but its world‑knowledge depth is slightly behind V4‑Pro.
API changes
New model identifiers: deepseek‑v4‑pro (performance‑oriented) and deepseek‑v4‑flash (efficiency‑oriented). Legacy names deepseek‑chat and deepseek‑reasoner will be deprecated on 2026‑07‑24.
Resources
Technical report:
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdfModel hubs: https://huggingface.co/collections/deepseek-ai/deepseek-v4 and https://modelscope.cn/collections/deepseek-ai/DeepSeek-V4 API documentation for “thinking mode”:
https://api-docs.deepseek.com/zh-cn/guides/thinking_modeSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
