DeepSeek‑V4 Deep Dive: Engineering Million‑Token Context Efficiency
The article provides a thorough technical analysis of DeepSeek‑V4, detailing how mixed sparse attention (CSA + HCA), manifold‑constrained hyper‑connections, the Muon optimizer, FP4 quantization, and a suite of infrastructure tricks enable stable training and inference with up to one‑million token contexts while achieving state‑of‑the‑art benchmark results.
Architecture: three upgrades over V3
DeepSeek‑V4 retains the Transformer + DeepSeekMoE + MTP backbone but introduces three key changes:
Attention: Replaces V3’s MLA/DSA with a hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA).
Residual connection: Swaps the standard residual for manifold‑constrained hyper‑connections (mHC).
Optimizer: Replaces AdamW with the Muon optimizer (embedding and head still use AdamW).
MoE routing: Moves from sigmoid + top‑k to sqrt(softplus) with no node‑routing limit.
Early FFN layers: Change from dense to MoE + hash routing.
CSA + HCA – compressing the KV cache
CSA performs mild compression: every m adjacent tokens (m=4 in V4‑Pro) are summed into one compressed entry using learned softmax weights and positional bias. A lightweight Lightning Indexer then selects the top‑k most relevant compressed entries (k=1024 in V4‑Pro) for full attention.
HCA applies aggressive compression (m′=128) without overlap, keeping dense attention but discarding the top‑k selection, thus providing a global summary channel.
The two attention types are interleaved (first two layers HCA, then alternating CSA/HCA) to achieve both local fine‑reading and global browsing.
对每 m 个 token 的 KV,先算两路 C^a, C^b 和对应的权重 Z^a, Z^b
softmax 归一化后做加权求和,得到 1 个压缩 entrymHC – manifold‑constrained hyper‑connections
Instead of a standard residual, V4 uses mHC, where the residual matrix B is projected onto the doubly stochastic Birkhoff polytope via Sinkhorn‑Knopp iterations (20 steps). This guarantees each row and column sums to 1 and all elements are non‑negative, limiting the spectral norm to ≤ 1 and preventing gradient explosion in deep stacks.
The forward equation becomes: X_{l+1} = B_l X_l + C_l \mathcal{F}_l(A_l X_l) Implementation projects the raw \tilde{B} with an exponential, then normalises; A and C are bounded by sigmoid.
Muon optimizer
Muon replaces element‑wise second‑moment estimation with a Newton‑Schulz iteration that orthogonalises the momentum matrix before applying it to the gradient.
Algorithm: Muon for DeepSeek‑V4
G_t = ∇_W L # gradient
M_t = μ M_{t-1} + G_t # momentum
O'_t = HybridNewtonSchulz(μ M_t + G_t) # orthogonalisation
O_t = O'_t · √max(n,m) · γ # rescale RMS
W_t = W_{t-1}(1 - ηλ) − η O_t # decay + updateThe hybrid Newton‑Schulz schedule uses aggressive coefficients for the first eight steps and conservative ones for the final two, stabilising singular values at 1. Muon requires full‑gradient matrices, so ZeRO‑style parameter sharding is replaced by a knapsack‑based bucket allocation with ≤ 10 % padding overhead.
Infrastructure: making the design runnable
V4’s infra tackles the dominant bottleneck of expert parallelism (EP) all‑to‑all communication by batching experts into waves and fusing communication with computation in a single CUDA mega‑kernel (open‑sourced as DeepGEMM). This yields 1.5–1.73× speed‑ups in generic inference and up to 1.96× in RL rollout scenarios.
TileLang, a domain‑specific language, generates fused kernels and host code at compile time, reducing kernel launch overhead from tens of microseconds to sub‑microsecond levels. Z3‑SMT assists in formal verification of vectorisation, memory hazards, and boundary conditions.
Batch‑invariant and deterministic kernels ensure that token outputs are independent of batch position and that backward passes are repeatable, which is crucial for RL training.
FP4 quantisation‑aware training (QAT) halves MoE expert weights and the CSA indexer’s QK path. De‑quantising from FP4 to FP8 is lossless because FP8’s larger exponent range fully absorbs FP4’s fine‑grained scales.
Training pipeline
Pre‑training scales to 32 T tokens with progressive sequence‑length expansion (4K → 16K → 64K → 1M) and a shift from document‑level to sample‑level attention masking. V4‑Flash (13 B activated parameters) and V4‑Pro (49 B activated, 1.6 T total) differ in depth, hidden size, and expert count.
Stability tricks include:
Anticipatory Routing: Uses routing indices computed Δt steps earlier with cached parameters, breaking the outlier‑routing feedback loop; activated only on loss spikes, adding ~20 % wall‑time.
SwiGLU Clamping: Clips linear component to [‑10, 10] and gate component to ≤ 10, eliminating activation outliers.
Post‑training replaces mixed SFT + RL with an Offline‑Policy‑Distillation (OPD) pipeline: specialist models for each domain (math, code, agentic, instruction‑following) are first fine‑tuned with SFT + GRPO RL, then a student model distils all specialist logits via a full‑vocabulary KL loss:
\mathcal{L}_{\text{OPD}}(θ) = \sum_i w_i \cdot D_{\text{KL}}(π_θ \| π_{E_i})Teachers are off‑loaded to distributed storage; only the final hidden state is cached, and logits are reconstructed on‑the‑fly to avoid materialising a 100K‑plus vocabulary.
Performance and benchmarks
SimpleQA‑Verified 57.9 % (≈ 20 pp above all open‑source models).
Codeforces 3206 (human rank 23), matching GPT‑5.4.
HMMT 2026 Feb 95.2, IMOAnswerBench 89.8, Apex Shortlist 90.2.
PutnamBench full 120/120, on par with Axiom.
1M MRCR at 1024K context retains 0.59 MMR, fully stable.
Agent benchmarks: Terminal Bench 2.0 67.9, SWE Verified 80.6, BrowseComp 83.4.
Chinese writing tasks: Gemini‑3.1‑Pro win‑rate 62.7 % vs 34.1 %; V4‑Pro‑Max shows strong advantage in functional and creative writing.
Limitations and open questions
The paper lists three main limitations:
Architectural complexity – many intertwined tricks may hinder reproducibility.
Unexplained mechanisms – Anticipatory Routing and SwiGLU Clamping lack theoretical justification.
Remaining performance gap – reasoning benchmarks still lag behind top‑tier closed‑source models by 3–6 months.
Open research directions include:
Whether CSA top‑k = 1024 suffices for all tasks.
Cost of 20‑step Sinkhorn normalisation in mHC at larger scales.
Stability boundaries of FP4 QAT during full pre‑training.
Potential loss of exploration when replacing RL with OPD.
Impact of the Δt hyper‑parameter in Anticipatory Routing.
Conclusion
DeepSeek‑V4’s most lasting contribution is the suite of modular components—CSA + HCA for long‑context efficiency, mHC for stable residuals, Muon for faster convergence, MegaMoE and TileLang for scalable infrastructure, and OPD as an alternative to mixed‑objective RL. By open‑sourcing both the paper and the implementation, DeepSeek provides a rare, reproducible foundation for the next wave of million‑token LLM applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
