DeepSeek V4 Preview: Key Technical Highlights, Benchmarks, and Pricing
The DeepSeek‑V4 preview details two model variants—Pro and Flash—with trillion‑scale parameters, outlines benchmark scores that surpass or match leading overseas models across code generation, real‑world fixes, engineering tasks, and world knowledge, and explains core innovations, pricing, API endpoints, and open‑source licensing.
Model Versions
DeepSeek‑V4 is released in two parallel versions. The flagship DeepSeek‑V4‑Pro has 1.6 trillion total parameters and 49 billion activation parameters, offering three inference modes (Non‑think, Think High, Think Max) that approach the performance of top‑tier closed‑source models on complex agentic tasks. The cost‑effective DeepSeek‑V4‑Flash contains 284 billion total parameters and 13 billion activation parameters, inherits the Pro training pipeline, provides 1 million context length, and is positioned as a foundation for enterprise‑scale deployment and cost reduction, with early Coding Agent capabilities.
Benchmark Highlights (selected)
Complex Code Generation (LiveCodeBench) : V4‑Pro scores 93.5, surpassing Gemini‑3.1‑Pro High (91.7) and Opus‑4.6 Max (88.8).
Real‑world Repository Fixes (SWE Verified) : V4‑Pro scores 80.6, comparable to Gemini‑3.1‑Pro High and slightly below Opus‑4.6 Max (80.8).
Complex Engineering Tasks (SWE Pro) : V4‑Pro scores 55.4, near the top tier, but lower than K2.6 (58.6) and GLM‑5.1 (58.4).
Agent Terminal Tasks (Terminal 2.0) : V4‑Pro scores 67.9, below GPT‑5.4 xHigh (75.1) and close to Gemini‑3.1‑Pro High (68.5).
General World Knowledge (MMLU‑Pro) : V4‑Pro achieves 87.5 in Think Max mode, matching GPT‑5.4 xHigh and below Gemini‑3.1‑Pro High (91.0).
Commercial Deployment Cost : V4‑Pro is described as “extremely high” cost‑effectiveness, with API pricing advantage over most overseas models.
Core Technical Innovations
Hybrid Attention : Combines Compressed Sparse Attention (CSA) and Heavy‑Compression Attention (HCA), reducing inference compute to 27 % and KV‑Cache demand to 10 % for 1 M context.
Manifold‑Constrained Hyper‑Connection (mHC) : Enhances residual connections to improve signal fidelity across layers, stabilizing deep‑model inference.
Muon Optimizer : Replaces traditional Adam families, delivering faster convergence and higher training stability at ultra‑large scale.
Modular Post‑Training Pipeline : A two‑stage paradigm where SFT and GRPO first train independent experts, then distill them losslessly into the base model.
Pricing (per 1 M tokens)
V4‑Flash: $0.028 (input cache hit), $0.14 (input miss), $0.28 (output). In Chinese RMB: ≈0.2 ¥ (hit), ≈1.0 ¥ (miss), ≈2.0 ¥ (output). V4‑Pro: $0.145 (hit), $1.74 (miss), $3.48 (output). In RMB: ≈1.0 ¥ (hit), ≈12 ¥ (miss), ≈24 ¥ (output). The note emphasizes that V4‑Flash’s API cost is significantly lower than most overseas models, though exact multiples depend on comparison model and cache hit rate.
API Usage
Two compatible endpoints are provided: an OpenAI‑compatible endpoint ( https://api.deepseek.com) and an Anthropic‑compatible endpoint ( https://api.deepseek.com/anthropic). Model names are deepseek-v4-pro or deepseek-v4-flash. The service supports native JSON output, tool calls, prefix completion, and code filling. The legacy deepseek-chat / reasoner endpoints will be retired on 2026‑07‑24, automatically mapping to V4‑Flash modes during the transition.
Ecosystem and Open Source
The model weights (1.6 trillion parameters) are released under the MIT license, allowing commercial deployment and downstream development. DeepSeek‑V4 is the first to achieve native adaptation to Huawei’s Ascend AI chips, reducing reliance on CUDA and promoting a domestic compute ecosystem. The authors argue that the model’s scale, benchmark performance, and open licensing provide a more accessible path for enterprises and researchers to adopt advanced AI capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
