Artificial Intelligence 10 min read

DeepSeek V4 Unveiled: 1M‑Token Context and New Architecture Challenge Closed‑Source LLMs

DeepSeek V4 introduces two flagship models—V4‑Pro with 1.6 T parameters and V4‑Flash with 284 B parameters—offering million‑token context, mixed attention (CSA + HCA), manifold‑constrained residuals, and the Muon optimizer, delivering open‑source performance that rivals top closed‑source LLMs while cutting inference cost dramatically.

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026

DeepSeek V4 Unveiled: 1M‑Token Context and New Architecture Challenge Closed‑Source LLMs

Model variants

DeepSeek V4 series includes two models: V4‑Pro with 1.6 T total parameters and 49 B activation parameters, and V4‑Flash with 284 B total parameters and 13 B activation parameters.

Performance benchmarks

Agentic coding ability surpasses Sonnet 4.5 and approaches Opus 4.6.

World‑knowledge scores narrow the gap to Gemini‑Pro‑3.1.

Logical‑reasoning on mathematics, STEM and competitive coding outperforms other open‑source models.

Core architectural innovations

Mixed attention (CSA + HCA) : Compressed Sparse Attention (CSA) compresses the KV cache by a factor of 4, merging every four tokens and applying DSA sparse attention; a Lightning Indexer scores compressed entries in FP4 precision and selects the top‑1024 per query token. Hierarchical Compressed Attention (HCA) further compresses by 128× without sparse selection, preserving a global view. This “long‑short” strategy reduces compute and memory for 1 M‑token contexts.

Manifold‑constrained residual connections (mHC) : Residual matrices are projected onto the Birkhoff polytope using 20 Sinkhorn‑Knopp iterations, keeping the spectral norm ≤ 1. The projection adds ~6.7 % overhead.

Muon optimizer : Orthogonalizes gradient momentum via Newton‑Schulz iterations (10 mixed iterations: 8 fast‑converging, 2 for fine stability).

Training methodology

On‑Policy Distillation (OPD) replaces the previous RL mix: domain‑specific experts (math, code, agent) are trained separately, then a student model distills logits from dozens of experts using only final hidden states; logits are recomputed on‑the‑fly with a TileLang kernel.

Stability techniques:

Anticipatory Routing decouples routing‑index computation from the backbone and triggers on loss spikes.

SwiGLU Clamping caps linear components to [‑10, 10] and gate values to 10.

Generative Reward Model lets the actor network serve as its own reward evaluator.

FP4‑aware quantization is applied to MoE expert weights and CSA indexer; FP4→FP8 de‑quantization is lossless.

Inference efficiency

Per‑token compute of V4‑Pro is 27 % of its predecessor. KV cache usage drops to ~2 % of a BF16 GQA8 baseline. Mixed‑precision storage uses BF16 for RoPE and FP8 for other tensors, halving memory volume. The system separates compressed KV from sliding‑window KV, supports disk‑level caching, and avoids redundant prefills.

V4‑Flash trade‑offs

V4‑Flash retains comparable agent performance with lower parameter budget and reduced API cost, but its world‑knowledge depth is slightly behind V4‑Pro.

API changes

New model identifiers: deepseek‑v4‑pro (performance‑oriented) and deepseek‑v4‑flash (efficiency‑oriented). Legacy names deepseek‑chat and deepseek‑reasoner will be deprecated on 2026‑07‑24.

Resources

Technical report:

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

Model hubs: https://huggingface.co/collections/deepseek-ai/deepseek-v4 and https://modelscope.cn/collections/deepseek-ai/DeepSeek-V4 API documentation for “thinking mode”:

https://api-docs.deepseek.com/zh-cn/guides/thinking_mode

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSeek large language model open-source AI mixed attention Muon optimizer 1M context agent capability

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.