DeepSeek-V4 Launches: Million-Token Context Becomes Affordable for All
DeepSeek-V4 introduces a hybrid attention architecture, manifold‑constrained hyper‑connections, and the Muon optimizer to cut inference FLOPs and KV cache dramatically, enabling open‑source models to handle million‑token contexts at a fraction of the cost of leading closed‑source services while matching their performance.
DeepSeek-V4 has been released, marking a watershed moment for the open‑source AI community by delivering efficient million‑token context capability that was previously limited to a few tech giants.
Key Architectural and Optimization Breakthroughs
The model series incorporates three major upgrades:
Hybrid attention architecture that combines Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA), reducing single‑token inference FLOPs to 27% of DeepSeek‑V3.2 and KV‑cache usage to 10% in million‑token scenarios.
Manifold‑Constrained Hyper‑Connections (mHC) that enhance traditional residual links, preserving model expressiveness while stabilising cross‑layer signal propagation.
Muon optimizer, which accelerates convergence and improves training stability.
These innovations allow the model to differentiate information importance and perform selective computation, dramatically lowering both compute and KV‑cache demands for processing ultra‑long texts.
Cost and Accessibility Impact
By slashing resource requirements, developers can now feed an entire code repository or massive document collections to the model at low cost. DeepSeek‑V4‑Flash API is priced at 2 CNY (≈ 0.3 USD) per million tokens, compared with OpenAI’s GPT‑5.5 price of $30 per million tokens, effectively breaking the cost barrier for individuals and small businesses.
Model Variants and Parameter Configurations
Two main variants are offered:
DeepSeek‑V4‑Pro with 1.6 T total parameters and 49 B activation parameters.
DeepSeek‑V4‑Flash with 284 B total parameters and 13 B activation parameters.
Both support a context length of one million tokens. The base version runs at FP8 precision, while the instruction‑tuned version uses a FP4 + FP8 mixture (FP4 for MoE expert parameters, FP8 for the rest).
Performance and Inference Modes
Both variants support three inference‑intensity modes. Under the same mode, DeepSeek‑V4‑Flash delivers performance comparable to V4‑Pro while its output cost is more than ten times lower, making it a “treasure model” for everyday tasks.
Benchmark Results: Matching Top Closed‑Source Models
DeepSeek‑V4‑Pro‑Max’s performance gap to the world’s leading closed‑source models is now minimal. In Agentic Coding evaluations, V4‑Pro achieves the best open‑source score and performs strongly on other agent benchmarks, surpassing Sonnet 4.5 and approaching Opus 4.6 in non‑thinking mode.
In world‑knowledge tests, V4‑Pro leads all other open models and trails only the top closed model Gemini‑Pro‑3.1. It also outperforms all publicly evaluated open models on mathematics, STEM, and competitive coding tasks, delivering results comparable to the best closed‑source systems.
Compatibility and Ecosystem Integration
DeepSeek‑V4 has been adapted and optimised for major agent products such as Claude Code, OpenClaw, OpenCode, and CodeBuddy. It also fully supports domestic chips, reducing reliance on Nvidia CUDA and opening a path toward diversified, self‑controlled compute resources.
Conclusion
With its architectural innovations, dramatically reduced resource requirements, and competitive benchmark performance, DeepSeek‑V4 positions the open‑source community as a strong driver of the upcoming trends of million‑token context and high‑performance agents, lowering the entry barrier without sacrificing capability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
