Artificial Intelligence 25 min read

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

DeepSeek V4, released shortly after GPT‑5.5, offers two models—V4‑Pro (1.6 T parameters) and V4‑Flash (284 B parameters)—that introduce a hybrid CSA/HCA attention architecture to enable efficient million‑token context, achieve dramatic FLOPs and KV savings, deliver competitive programming and agent benchmarks, and adopt a disruptive pricing strategy, while also exposing training‑stability tricks and highlighting both strengths and remaining gaps.

Shuge Unlimited

Apr 25, 2026

DeepSeek V4: Comeback? 1.6 T Params, Million‑Token Context, Open‑Source Matches Closed‑Source

1. Two Models, Two Positions

DeepSeek V4 launched two variants with distinct goals. V4‑Pro has 1.6 T total parameters (49 B activation), 61 Transformer layers, and is currently the largest open‑source weight model, surpassing Kimi K2.6 (1.1 T) and GLM‑5.1 (754 B). V4‑Flash uses 284 B total parameters (13 B activation) and 43 layers, targeting a high cost‑performance ratio while still outperforming the V3.2‑Base baseline on many benchmarks. Both models are released under the MIT license with weights on HuggingFace.

2. CSA + HCA: The Secret of Million‑Token Context

Traditional Transformers incur quadratic attention cost and KV‑cache growth with sequence length, making million‑token contexts infeasible. DeepSeek redesigns the attention stack with two complementary mechanisms that are alternated across layers (first two layers use pure sliding‑window or HCA, later layers interleave CSA and HCA).

Compressed Sparse Attention (CSA)

Compression : Every 4 tokens are compressed into one entry using two learnable compression matrices C_a and C_b plus a learnable positional bias. Overlap is introduced so each compressed entry covers 2 m tokens, half of which are shared with the previous entry. V4‑Pro achieves a 4:1 compression ratio.

Sparse selection : A lightweight Lightning Indexer projects each query to a low‑dimensional space d_c, computes dot‑products with all compressed blocks, applies ReLU (instead of Softmax), and selects the top‑k blocks (k = 1024 for Pro, 512 for Flash).

Local supplement : An uncompressed sliding‑window KV of 128 tokens is kept for each layer to preserve fine‑grained local dependencies; this window KV is concatenated with the compressed KV before the core attention computation.

The key insight is that the sparsity pattern is trainable: the model learns which positions need dense attention and which can be safely compressed.

Heavily Compressed Attention (HCA)

HCA pursues an even more aggressive compression (128 tokens → 1 entry, 32× the CSA compression rate) and skips the sparse‑selection step, applying dense attention over the compressed sequence. This provides a global view for layers that need full‑sequence context, complementing CSA’s efficiency.

Both CSA and HCA share a Shared Key‑Value Multi‑Query Attention (MQA) design and a Grouped Output Projection that splits 128 heads into 16 groups, reduces dimensionality per group, and then merges the results.

Performance impact reported in the technical report:

V4‑Pro token‑wise inference FLOPs ≈ 27 % of V3.2.

KV cache size ≈ 10 % of V3.2.

V4‑Flash FLOPs ≈ 10 % and KV cache ≈ 7 % of V3.2.

Compared with a BF16 GQA‑8 baseline, V4’s KV cache at 1 M context is only ~2 % of the baseline.

3. mHC and Muon: Training Stability

Training a 1.6 T model requires additional stability mechanisms.

mHC (Manifold‑Constrained Hyper‑Connection)

Standard residual connections can cause signal explosion or vanishing. DeepSeek constrains the residual mapping matrix to the Birkhoff polytope (the set of doubly‑stochastic matrices) using the Sinkhorn‑Knopp algorithm (20 iterations). This guarantees a spectral norm ≤ 1, making the mapping non‑expansive. The matrix is generated as a dynamic component (input‑dependent linear transform) plus a static bias, with Sigmoid bounds to keep values non‑negative.

In practice, mHC adds only ~6.7 % overhead to the pipeline stage time.

Muon Optimizer

Muon replaces AdamW in almost all modules (embedding, prediction head, static bias of mHC, and all RMSNorm weights retain AdamW). Muon updates parameters via 10 rounds of mixed Newton‑Schulz iterations: the first 8 rounds use coefficients (3.4445, ‑4.7750, 2.0315) for rapid convergence, and the final 2 rounds switch to (2, ‑1.5, 0.5) for fine‑tuning stability. The report claims faster convergence and better stability on large‑scale training.

Muon is combined with ZeRO’s knapsack allocation for dense parameters and expert‑wise partitioning for MoE parameters, avoiding the conflict between Muon’s need for full gradients and ZeRO’s sharding.

Additional Stability Tricks

Anticipatory Routing : At step t, routing indices are computed with parameters from step t‑Δt, breaking the feedback loop between router and backbone. Overhead ≈ 20 % but can be dynamically disabled when loss spikes are absent.

SwiGLU Clamping : Linear part of SwiGLU is clipped to [-10, 10] and the gated part to 10, suppressing outliers in MoE layers.

4. Performance Evaluation

Programming and Math Reasoning

Key benchmark scores (higher is better):

Codeforces: V4‑Pro 3206 (top 23 human rank), GPT‑5.4 3168, Gemini 3.1 Pro 3052, V4‑Flash 3052 (matches Gemini).

Apex Shortlist: V4‑Pro 90.2 vs GPT‑5.4 78.1 (12‑point lead).

LiveCodeBench: V4‑Pro 93.5 (no closed‑source comparison provided).

HMMT Feb 2026: V4‑Flash 97.7 (best among listed models).

IMOAnswerBench: V4‑Pro 89.8 vs GPT‑5.4 91.4 (GPT ahead).

Agent Capabilities

SWE Verified: V4‑Pro 80.6, Claude Opus 4.6 80.8, Gemini 3.1 Pro 80.6.

Terminal Bench 2.0: V4‑Pro 67.9, Claude Opus 65.4, Gemini 3.1 Pro 68.5.

Toolathlon (tool‑calling generalisation): V4‑Pro 51.8, Claude Opus 47.2, Gemini 3.1 Pro 48.8.

MCPAtlas: V4‑Pro 73.6, Claude Opus 73.8, Gemini 3.1 Pro 69.2.

BrowseComp: V4‑Pro 83.4, Claude Opus 83.7, Gemini 3.1 Pro 85.9.

Internal R&D coding tests report a 67 % pass rate for V4‑Pro‑Max, exceeding Claude Sonnet 4.5 (47 %) and approaching Claude Opus 4.5 (70 %).

Knowledge & Scientific Reasoning

SimpleQA: V4‑Pro 57.9 vs GPT‑5.4 45.3, Gemini 3.1 Pro 75.6.

HLE (Human Last Exam): V4‑Pro 37.7, GPT‑5.4 39.8, Gemini 3.1 Pro 44.4.

MMLU‑Pro: V4‑Pro 87.5, GPT‑5.4 87.5, Gemini 3.1 Pro 91.0.

GPQA Diamond: V4‑Pro 90.1, GPT‑5.4 93.0, Gemini 3.1 Pro 94.3.

The report admits a 3‑6 month gap in knowledge and scientific reasoning compared with the state‑of‑the‑art closed‑source models.

Long‑Context Performance

MRCR 1M (information retrieval): V4‑Pro 83.5, Claude Opus 92.9, Gemini 3.1 Pro 76.3.

CorpusQA 1M (long‑document QA): V4‑Pro 62.0, Claude Opus 71.7, Gemini 3.1 Pro 53.8.

MRCR remains stable up to 128 K tokens, degrades gradually beyond that but stays usable at 1 M tokens. Independent testing (虎嗅) suggests effective context may drop to ~100 K tokens in real‑world usage.

5. Agent Positioning and Pricing

DeepSeek positions V4 as the base model for the emerging Agent era rather than an application ecosystem. Internally, V4 is the primary model for agentic coding, receiving better user experience than Claude Sonnet 4.5. A survey of 85 engineers showed 52 % would adopt V4‑Pro as the default coding model.

Training incorporates On‑Policy Distillation (OPD) : domain‑specific expert models (math, coding, agent, instruction) are first SFT‑plus‑RL (GRPO) trained, then a unified student model learns from the experts’ trajectories via KL minimisation. Over ten teacher models are used.

Three inference modes are offered:

Non‑think : fast response.

Think High : explicit reasoning chain.

Think Max : system prompt forces exhaustive reasoning, edge‑case handling, and adversarial testing.

Pricing (RMB per million tokens):

Input cache hit: Flash 0.2 ¥, Pro 1 ¥.

Input cache miss: Flash 1 ¥, Pro 12 ¥.

Output: Flash 2 ¥, Pro 24 ¥.

By contrast, GPT‑5.5 output costs ~218 ¥/M tokens, making Flash’s output price roughly 1 % of GPT‑5.5 and Pro’s price a floor among comparable models.

API usage is unchanged; only the model parameter switches between "deepseek‑v4‑pro" and "deepseek‑v4‑flash".

6. Objective Assessment: Strengths and Weaknesses

Strengths

Massive‑context efficiency via CSA/HCA – a paradigm shift rather than incremental scaling.

Programming performance that matches or surpasses closed‑source models (Codeforces 3206 historic open‑source breakthrough).

Fully MIT‑licensed, weights directly downloadable from HuggingFace.

Transparent training tricks (Anticipatory Routing, SwiGLU Clamping) released for community exploration.

Weaknesses

Knowledge and scientific reasoning still lag 3‑6 months behind top closed‑source models.

Effective long‑context length in practice appears limited to ~100 K tokens.

No native multimodal (image/audio) support.

Pro version’s throughput is limited by hardware constraints, making large‑scale agent deployments challenging.

Integration issues with Claude Code reported by some users.

7. Third‑Party Perspectives

Vals AI’s Vibe Code Benchmark ranks V4 first among open‑source weight models, beating Gemini 3.1 Pro and delivering a ~10× jump over V3.2.

Arena.ai places V4‑Pro third among open‑source models in code‑arena and 14th overall.

Community feedback is mixed: some users feel Flash does not surpass V3.2, while others praise its 99 % cost advantage over Opus 4.7.

Overall, DeepSeek V4 delivers a genuine breakthrough in long‑context efficiency and brings open‑source programming capability to parity with leading closed‑source models, but it still trails in knowledge breadth, scientific reasoning, and multimodal support.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM open-source benchmark Long-Context hybrid attention training stability DeepSeek V4

Written by

Shuge Unlimited

Formerly "Ops with Skill", now officially upgraded. Fully dedicated to AI, we share both the why (fundamental insights) and the how (practical implementation). From technical operations to breakthrough thinking, we help you understand AI's transformation and master the core abilities needed to shape the future. ShugeX: boundless exploration, skillful execution.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Two Models, Two Positions

2. CSA + HCA: The Secret of Million‑Token Context

Compressed Sparse Attention (CSA)

Heavily Compressed Attention (HCA)

3. mHC and Muon: Training Stability

mHC (Manifold‑Constrained Hyper‑Connection)

Muon Optimizer

Additional Stability Tricks

4. Performance Evaluation

Programming and Math Reasoning

Agent Capabilities

Knowledge & Scientific Reasoning

Long‑Context Performance

5. Agent Positioning and Pricing

6. Objective Assessment: Strengths and Weaknesses

7. Third‑Party Perspectives

Shuge Unlimited

How this landed with the community

Was this worth your time?

0 Comments

2. CSA + HCA: The Secret of Million‑Token Context