DeepSeek Slashes Prices Permanently, Cutting Model Costs to Near‑Zero
In April‑May 2026 DeepSeek permanently reduced its V4‑Pro and V4‑Flash API prices by up to 97.5%, citing hybrid‑attention architecture and tighter KV cache, a move that reshapes large‑model pricing, drives massive cost savings, and signals a broader industry shift.
In April‑May 2026 DeepSeek announced that the V4‑Pro API price would be cut to one‑quarter of its original level and applied permanently; Xiaomi followed with a permanent reduction for its MiMoV2.5 series APIs.
The permanent revision covers the two main V4 variants: V4‑Pro (a 1.6‑trillion‑parameter MoE model with a million‑token context window) and V4‑Flash (a lightweight general‑purpose version). Pricing changes are:
Uncached input: from 12 ¥/M tokens to 3 ¥/M tokens (‑75%).
Inference output: from 24 ¥/M tokens to 6 ¥/M tokens (‑75%).
Cached input: from 1 ¥/M tokens to 0.025 ¥/M tokens (‑97.5%), i.e., 2.5 cents per million tokens.
These rates make V4‑Pro’s standard input price roughly 1/72 of GPT‑5.5 Pro. In high cache‑hit scenarios enterprise usage costs can drop by more than 90%. OpenRouter data shows that since May V4‑Flash has ranked first in call volume. The price cut does not sacrifice model capability; DeepSeek attributes the reduction to architectural optimisations: a hybrid‑attention design and multi‑token prediction lower per‑token floating‑point operations to 27% of the previous generation, and KV‑cache size shrinks to 10% of its predecessor.
DeepSeek has also completed deep integration with Huawei’s Ascend processors, providing a mature domestic compute ecosystem and supply‑chain support.
Gartner predicts that by 2030 large‑model inference costs will be more than 90% lower than in 2025; DeepSeek’s permanent price cut exemplifies this long‑term trend and disrupts the existing competitive balance, ushering in a K‑shaped differentiation of base models. The reduction is expected to trigger a surge in total model calls, benefit cloud providers, accelerate the domestic AI hardware ecosystem, and enable a “token‑free” era where long‑document analysis, code generation, and other token‑intensive scenarios can run at scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
