DeepSeek V4’s Silent Launch: 1.6 T Parameters, Triple Innovation, and Redefined Accessibility
DeepSeek V4 quietly debuted with a 1.6‑trillion‑parameter MoE model, introducing CSA+HCA compressed attention, mHC manifold‑constrained hyperconnections, and the Muon optimizer, achieving 1M‑token context at a quarter of V3’s cost, top Codeforces and LiveCodeBench scores, a 1/7 Opus price, MIT open‑source licensing, and dual‑stack Ascend NPU/NVIDIA GPU support.
Release and core question
On 24 April DeepSeek released V4, a 1.6 trillion‑parameter mixture‑of‑experts model with a 1 million‑token context window, under an MIT license.
Metrics beyond parameter count
Performance is evaluated on three axes: capability ceiling (hardest tasks), inference efficiency (compute per token), and accessibility (affordability for users and enterprises). V4 aims to raise the capability ceiling while keeping inference cost sub‑linear.
Architectural innovations
Compressed Sequence Attention (CSA) and Heavily Compressed Attention (HCA)
Traditional attention scales O(L²). CSA compresses token sequences to remove redundancy; HCA further compresses for long‑range dependencies. At a 1 M context, V4‑Pro’s per‑token FLOPs are 27 % of V3.2 and KV‑Cache occupies 10 % of memory, reducing inference cost to roughly one‑quarter of V3 on the same hardware.
Manifold‑constrained Hyperconnection (mHC)
Scaling MoE from 671 B to 1.6 T parameters caused gradient explosion and routing collapse. mHC imposes a geometric constraint that forces signals to travel on a structured manifold, suppressing gradient dispersion and enabling stable training of >1 T‑parameter MoE models.
Muon optimizer
Replaces AdamW with an optimizer that applies matrix orthogonalization in its momentum update. Under equal compute, Muon converges faster and reaches lower final loss, demonstrated on >32 T tokens of pre‑training data.
Benchmark results
Public evaluations show:
Codeforces rating : 3206 (highest among compared models).
LiveCodeBench : 93.5 %.
SWE Verified : 80.6 % (self‑reported, not directly comparable to Claude Opus 4.6 87.6 %).
Terminal Bench : 67.9 %.
On the FundaAI 38‑task suite:
Weighted average: Claude Opus 4.6 (think) 8.72 > V4‑Pro 8.27 > V4‑Flash 8.01.
Financial research: V4‑Pro ties with Opus 4.7 (7 : 7).
Game‑theory (NVDA task): V4‑Pro scores 10/10.
Cost per task: V4‑Flash $0.007, far cheaper than Claude Opus.
In knowledge and reasoning benchmarks V4‑Pro trails Opus 4.6 and Gemini 3.1 Pro on MMLU‑Pro (87.5 % vs 89.1 %/91.0 %) and GPQA Diamond (90.1 % vs 91.3 %/94.3 %). It matches or exceeds them on IMOAnswerBench (89.8 vs 75.3/91.4) and SimpleQA‑Verified (57.9 vs 46.2/75.6).
Hardware validation
V4 is the first frontier model verified on both Huawei Ascend NPU and NVIDIA GPU. Reported numbers:
Ascend 950: 20 ms latency for V4‑Pro, 10 ms for V4‑Flash.
Ascend A3 super‑node provides a reference training implementation.
Cambricon completed Day‑0 vLLM support.
Pricing and licensing
MIT‑licensed on HuggingFace. Output cost per million tokens:
V4‑Flash: $0.28 (baseline 1×).
V4‑Pro: $3.48 (≈ 12× baseline, ≈ 1/4 of GPT‑5.4 $15 and ≈ 1/7 of Claude Opus 4.6 $25).
Night‑time (23:00‑07:00 CST) cache‑hit price: ¥0.2 per million tokens.
Limitations
Thought‑mode performance lags behind Claude Opus 4.6, with occasional time‑outs on complex reasoning.
Inference throughput for V4‑Pro currently depends on NVIDIA GPUs; Ascend production capacity is a short‑term bottleneck.
Long‑term sustainability of ultra‑low pricing depends on scaling and ecosystem value.
References
[1]https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro [2] https://www.qbitai.com/2026/04/406359.html [3] https://fundaai.substack.com/p/deepdeepseek-v4-vs-claude-vs-gpt [4] https://simonwillison.net/2026/apr/24/deepseek-v4/ [5] https://api-docs.deepseek.com/news/news260424 [6] https://www.cls.cn/detail/2354690 [7] https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ArcThink
ArcThink makes complex information clearer and turns scattered ideas into valuable insights and understanding.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
