Artificial Intelligence 22 min read

DeepSeek‑V4 Deep Dive: Engineering Million‑Token Context Efficiency

The article provides a thorough technical analysis of DeepSeek‑V4, detailing how mixed sparse attention (CSA + HCA), manifold‑constrained hyper‑connections, the Muon optimizer, FP4 quantization, and a suite of infrastructure tricks enable stable training and inference with up to one‑million token contexts while achieving state‑of‑the‑art benchmark results.

DeepHub IMBA

Apr 27, 2026

DeepSeek‑V4 Deep Dive: Engineering Million‑Token Context Efficiency

Architecture: three upgrades over V3

DeepSeek‑V4 retains the Transformer + DeepSeekMoE + MTP backbone but introduces three key changes:

Attention: Replaces V3’s MLA/DSA with a hybrid of Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA).

Residual connection: Swaps the standard residual for manifold‑constrained hyper‑connections (mHC).

Optimizer: Replaces AdamW with the Muon optimizer (embedding and head still use AdamW).

MoE routing: Moves from sigmoid + top‑k to sqrt(softplus) with no node‑routing limit.

Early FFN layers: Change from dense to MoE + hash routing.

CSA + HCA – compressing the KV cache

CSA performs mild compression: every m adjacent tokens (m=4 in V4‑Pro) are summed into one compressed entry using learned softmax weights and positional bias. A lightweight Lightning Indexer then selects the top‑k most relevant compressed entries (k=1024 in V4‑Pro) for full attention.

HCA applies aggressive compression (m′=128) without overlap, keeping dense attention but discarding the top‑k selection, thus providing a global summary channel.

The two attention types are interleaved (first two layers HCA, then alternating CSA/HCA) to achieve both local fine‑reading and global browsing.

对每 m 个 token 的 KV，先算两路 C^a, C^b 和对应的权重 Z^a, Z^b
softmax 归一化后做加权求和，得到 1 个压缩 entry

mHC – manifold‑constrained hyper‑connections

Instead of a standard residual, V4 uses mHC, where the residual matrix B is projected onto the doubly stochastic Birkhoff polytope via Sinkhorn‑Knopp iterations (20 steps). This guarantees each row and column sums to 1 and all elements are non‑negative, limiting the spectral norm to ≤ 1 and preventing gradient explosion in deep stacks.

The forward equation becomes: X_{l+1} = B_l X_l + C_l \mathcal{F}_l(A_l X_l) Implementation projects the raw \tilde{B} with an exponential, then normalises; A and C are bounded by sigmoid.

Muon optimizer

Muon replaces element‑wise second‑moment estimation with a Newton‑Schulz iteration that orthogonalises the momentum matrix before applying it to the gradient.

Algorithm: Muon for DeepSeek‑V4
 G_t   = ∇_W L                     # gradient
 M_t   = μ M_{t-1} + G_t           # momentum
 O'_t  = HybridNewtonSchulz(μ M_t + G_t) # orthogonalisation
 O_t   = O'_t · √max(n,m) · γ      # rescale RMS
 W_t   = W_{t-1}(1 - ηλ) − η O_t   # decay + update

The hybrid Newton‑Schulz schedule uses aggressive coefficients for the first eight steps and conservative ones for the final two, stabilising singular values at 1. Muon requires full‑gradient matrices, so ZeRO‑style parameter sharding is replaced by a knapsack‑based bucket allocation with ≤ 10 % padding overhead.

Infrastructure: making the design runnable

V4’s infra tackles the dominant bottleneck of expert parallelism (EP) all‑to‑all communication by batching experts into waves and fusing communication with computation in a single CUDA mega‑kernel (open‑sourced as DeepGEMM). This yields 1.5–1.73× speed‑ups in generic inference and up to 1.96× in RL rollout scenarios.

TileLang, a domain‑specific language, generates fused kernels and host code at compile time, reducing kernel launch overhead from tens of microseconds to sub‑microsecond levels. Z3‑SMT assists in formal verification of vectorisation, memory hazards, and boundary conditions.

Batch‑invariant and deterministic kernels ensure that token outputs are independent of batch position and that backward passes are repeatable, which is crucial for RL training.

FP4 quantisation‑aware training (QAT) halves MoE expert weights and the CSA indexer’s QK path. De‑quantising from FP4 to FP8 is lossless because FP8’s larger exponent range fully absorbs FP4’s fine‑grained scales.

Training pipeline

Pre‑training scales to 32 T tokens with progressive sequence‑length expansion (4K → 16K → 64K → 1M) and a shift from document‑level to sample‑level attention masking. V4‑Flash (13 B activated parameters) and V4‑Pro (49 B activated, 1.6 T total) differ in depth, hidden size, and expert count.

Stability tricks include:

Anticipatory Routing: Uses routing indices computed Δt steps earlier with cached parameters, breaking the outlier‑routing feedback loop; activated only on loss spikes, adding ~20 % wall‑time.

SwiGLU Clamping: Clips linear component to [‑10, 10] and gate component to ≤ 10, eliminating activation outliers.

Post‑training replaces mixed SFT + RL with an Offline‑Policy‑Distillation (OPD) pipeline: specialist models for each domain (math, code, agentic, instruction‑following) are first fine‑tuned with SFT + GRPO RL, then a student model distils all specialist logits via a full‑vocabulary KL loss:

\mathcal{L}_{\text{OPD}}(θ) = \sum_i w_i \cdot D_{\text{KL}}(π_θ \| π_{E_i})

Teachers are off‑loaded to distributed storage; only the final hidden state is cached, and logits are reconstructed on‑the‑fly to avoid materialising a 100K‑plus vocabulary.

Performance and benchmarks

SimpleQA‑Verified 57.9 % (≈ 20 pp above all open‑source models).

Codeforces 3206 (human rank 23), matching GPT‑5.4.

HMMT 2026 Feb 95.2, IMOAnswerBench 89.8, Apex Shortlist 90.2.

PutnamBench full 120/120, on par with Axiom.

1M MRCR at 1024K context retains 0.59 MMR, fully stable.

Agent benchmarks: Terminal Bench 2.0 67.9, SWE Verified 80.6, BrowseComp 83.4.

Chinese writing tasks: Gemini‑3.1‑Pro win‑rate 62.7 % vs 34.1 %; V4‑Pro‑Max shows strong advantage in functional and creative writing.

Limitations and open questions

The paper lists three main limitations:

Architectural complexity – many intertwined tricks may hinder reproducibility.

Unexplained mechanisms – Anticipatory Routing and SwiGLU Clamping lack theoretical justification.

Remaining performance gap – reasoning benchmarks still lag behind top‑tier closed‑source models by 3–6 months.

Open research directions include:

Whether CSA top‑k = 1024 suffices for all tasks.

Cost of 20‑step Sinkhorn normalisation in mHC at larger scales.

Stability boundaries of FP4 QAT during full pre‑training.

Potential loss of exploration when replacing RL with OPD.

Impact of the Δt hyper‑parameter in Anticipatory Routing.

Conclusion

DeepSeek‑V4’s most lasting contribution is the suite of modular components—CSA + HCA for long‑context efficiency, mHC for stable residuals, Muon for faster convergence, MegaMoE and TileLang for scalable infrastructure, and OPD as an alternative to mixed‑objective RL. By open‑sourcing both the paper and the implementation, DeepSeek provides a rare, reproducible foundation for the next wave of million‑token LLM applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model MoE Muon optimizer DeepSeek V4 million-token context mHC CSA HCA FP4 quantization

Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.