DeepSeek Slashes Prices Permanently, Cutting Model Costs to Near‑Zero

In April‑May 2026 DeepSeek permanently reduced its V4‑Pro and V4‑Flash API prices by up to 97.5%, citing hybrid‑attention architecture and tighter KV cache, a move that reshapes large‑model pricing, drives massive cost savings, and signals a broader industry shift.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
DeepSeek Slashes Prices Permanently, Cutting Model Costs to Near‑Zero

In April‑May 2026 DeepSeek announced that the V4‑Pro API price would be cut to one‑quarter of its original level and applied permanently; Xiaomi followed with a permanent reduction for its MiMoV2.5 series APIs.

The permanent revision covers the two main V4 variants: V4‑Pro (a 1.6‑trillion‑parameter MoE model with a million‑token context window) and V4‑Flash (a lightweight general‑purpose version). Pricing changes are:

Uncached input: from 12 ¥/M tokens to 3 ¥/M tokens (‑75%).

Inference output: from 24 ¥/M tokens to 6 ¥/M tokens (‑75%).

Cached input: from 1 ¥/M tokens to 0.025 ¥/M tokens (‑97.5%), i.e., 2.5 cents per million tokens.

These rates make V4‑Pro’s standard input price roughly 1/72 of GPT‑5.5 Pro. In high cache‑hit scenarios enterprise usage costs can drop by more than 90%. OpenRouter data shows that since May V4‑Flash has ranked first in call volume. The price cut does not sacrifice model capability; DeepSeek attributes the reduction to architectural optimisations: a hybrid‑attention design and multi‑token prediction lower per‑token floating‑point operations to 27% of the previous generation, and KV‑cache size shrinks to 10% of its predecessor.

DeepSeek has also completed deep integration with Huawei’s Ascend processors, providing a mature domestic compute ecosystem and supply‑chain support.

Gartner predicts that by 2030 large‑model inference costs will be more than 90% lower than in 2025; DeepSeek’s permanent price cut exemplifies this long‑term trend and disrupts the existing competitive balance, ushering in a K‑shaped differentiation of base models. The reduction is expected to trigger a surge in total model calls, benefit cloud providers, accelerate the domestic AI hardware ecosystem, and enable a “token‑free” era where long‑document analysis, code generation, and other token‑intensive scenarios can run at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeepSeekAI market trendsHuawei Ascend integrationhybrid attention architectureinference cost reductionlarge language model pricing
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.