DeepSeek V4 Unveiled: Dual Versions with 1M Token Context and New Mixed‑Attention Architecture
DeepSeek V4 launches two models—Flash and Pro—both supporting up to 1 million token context and 384 K output tokens, offering non‑thinking and thinking modes with a reasoning_effort parameter, and featuring mixed attention, manifold‑constrained hyperconnections, a Muon optimizer, massive training data, and up to 73% FLOPs reduction versus V3.
DeepSeek V4 has been officially released in two variants, DeepSeek‑V4‑Flash and DeepSeek‑V4‑Pro, each supporting a maximum context length of 1 million tokens and an output length of up to 384 K tokens.
The new API documentation is live, and the preview version has been open‑sourced on HuggingFace (https://huggingface.co/collections/deepseek-ai/deepseek-v4).
Both variants provide a "non‑thinking" mode and a "thinking" mode; the latter enables the reasoning_effort parameter (high/max) to control reasoning strength, which is recommended for complex agent scenarios.
A detailed technical report is available (https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf).
DeepSeek‑V4‑Pro contains 1.6 T parameters with 49 B activation units, while DeepSeek‑V4‑Flash has 284 B parameters with 13 B activation units; both models retain the 1 M token context capability.
The core innovations include a mixed‑attention architecture that combines Compressed Sparse Attention (CSA) and Highly Compressed Attention (HCA), dramatically lowering computational complexity for ultra‑long contexts.
Manifold‑Constrained HyperConnection (mHC) strengthens traditional residual links, improving signal stability across layers.
The Muon optimizer is introduced to accelerate convergence and enhance training stability.
Training employed massive datasets—32 T tokens for Flash and 33 T tokens for Pro—followed by specialized training and strategy distillation to boost performance on reasoning, programming, and world‑knowledge tasks.
Long‑context efficiency gains are significant: compared with DeepSeek‑V3, FLOPs are reduced by 73 % and KV‑cache size by 90 %, making million‑token inference feasible.
Evaluation shows the V4‑Pro‑Max version sets new benchmarks on inference and knowledge tasks, surpassing previous open‑source models and approaching proprietary model performance; V4‑Flash‑Max delivers comparable inference speed at a lower parameter scale.
Hardware support includes a live demonstration on Huawei Ascend (scheduled for 4 PM) and Day 0 adaptation on Cambricon’s soft‑hardware ecosystem via the vLLM inference framework, with adaptation code open‑sourced on GitHub.
DeepSeek’s official statement concludes with a philosophical quote from Xunzi, emphasizing a calm, steadfast approach.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
