Artificial Intelligence 11 min read

DeepSeek‑V4 Open‑Sources Its Million‑Token Architecture and Calls Out Claude Opus 4.6

DeepSeek‑V4’s open‑source report reveals a hybrid CSA/HCA attention design, manifold‑constrained residuals and the Muon optimizer that cut per‑token FLOPs to 27 % and KV‑Cache to 10 % at 1 M tokens, while benchmark results show it outperforms Claude Opus 4.6 on most tasks yet still lags on complex instruction following and multi‑turn dialogue.

PaperAgent

Apr 24, 2026

DeepSeek‑V4 Open‑Sources Its Million‑Token Architecture and Calls Out Claude Opus 4.6

1. Million‑token context efficiency breakthrough

DeepSeek‑V4 (Pro/Flash) targets the efficiency bottleneck of ultra‑long context. By combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), applying manifold‑constrained hyper‑connection (mHC) and the Muon optimizer, V4‑Pro (1.6 T parameters, 49 B activations) and V4‑Flash (284 B parameters, 13 B activations) achieve only 27 % of the per‑token FLOPs and 10 % of the KV‑Cache cost of V3.2 at a 1 M‑token context.

Figure 1: Left – benchmark comparison of DeepSeek‑V4‑Pro‑Max vs Claude‑Opus‑4.6‑Max, GPT‑5.4‑xHigh, Gemini‑3.1‑Pro‑High; Right – efficiency gains over V3.2 (27 % FLOPs, 10 % KV‑Cache) at 1 M tokens

2. Direct jab at Claude

In the “White‑Collar Task” evaluation, DeepSeek‑V4‑Pro‑Max was pitted against Claude Opus 4.6‑Max. The report quotes a human‑evaluation comment: “It also excels in long‑form generation, delivering in‑depth, coherent narratives rather than relying on the overly simplistic bullet points frequently produced by Opus‑4.6‑Max.”

"It also excels in long‑form generation, delivering in‑depth, coherent narratives rather than relying on the overly simplistic bullet points frequently produced by Opus‑4.6‑Max ."

Figure 11 shows win‑rate percentages: analysis 55.0 % vs 37.0 %, generation 52.0 % vs 38.0 %, editing 47.0 % vs 35.0 %, overall 53.0 % vs 37.0 %.

Figure 11: DeepSeek‑V4‑Pro‑Max vs Opus‑4.6‑Max win‑rate comparison

Figure 12 provides detailed dimension scores, where DeepSeek leads in Task Completion, Content Quality and Formatting Aesthetics, but Claude slightly edges it in Instruction Following.

3. Benchmark standing beyond the jab

On public benchmarks, V4‑Pro‑Max ranks among the top open‑source models: SimpleQA‑Verified 57.9 (highest open‑source), Codeforces Rating 3206 (top 23 humans), Apex Shortlist 90.2 (surpassing GPT‑5.4 78.1 and Gemini‑3.1‑Pro 89.1), HMMT 2026 Feb 95.2, IMOAnswerBench 89.8.

The report notes that on knowledge‑heavy benchmarks (MMLU‑Pro, GPQA, HLE) V4‑Pro‑Max still trails Gemini‑3.1‑Pro, and on agent tasks it remains behind Claude Opus 4.6 and GPT‑5.4, but it is the first open‑source model to match frontier closed‑source performance in reasoning and code‑competition tasks.

4. Chinese‑language dominance

In Chinese writing evaluations, DeepSeek‑V4‑Pro outperforms Gemini‑3.1‑Pro with an overall win‑rate of 62.7 % vs 34.1 %, excelling in technical text (75.86 %), email (73.29 %) and personal reflections (75.56 %).

Table 12: Chinese writing pairwise comparison

Agentic Search also shows a qualitative leap over traditional RAG, as illustrated by the accompanying figures.

5. Acknowledged gaps

Table 14 reveals that on complex instruction following and multi‑turn writing, Claude‑Opus‑4.5 still leads: 46.9 % vs 53.1 % and 45.6 % vs 51.7 % respectively, giving Claude an overall advantage of 52.0 % vs 45.9 %.

Table 14: Complex instruction and multi‑turn comparison

6. Technical foundations behind the confidence

6.1 Hybrid attention: CSA + HCA

CSA compresses every m tokens into one KV entry and performs sparse Top‑k selection via a Lightning Indexer. HCA applies a much higher compression ratio m′ while retaining dense attention. Together they reduce KV‑Cache to roughly 2 % of traditional GQA at 1 M tokens.

Figure 5: Expert Parallelism communication‑compute overlap, 1.92× theoretical speed‑up

6.2 Muon optimizer

The Muon optimizer is introduced for trillion‑parameter MoE training. It combines a hybrid Newton‑Schulz iteration (first 8 steps fast convergence, last 2 steps fine‑tuning) to achieve orthogonalization, and uses FP4 quantization‑aware training to reduce memory consumption.

Algorithm 1: Muon Optimizer for DeepSeek‑V4

6.3 Post‑training: OPD unified expert

Two‑stage post‑training first trains specialist experts for math, code, agent and instruction domains (SFT + GRPO), then merges them into a single model via On‑Policy Distillation (OPD) using full‑vocabulary reverse KL distillation.

Table 2: Three inference modes – Non‑think, Think High, Think Max (384K context)

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

Large Language Model benchmark AI Architecture Claude Opus mixed attention DeepSeek V4

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.