Artificial Intelligence 8 min read

DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C

DeepSeek‑V4, a 1.6‑trillion‑parameter MoE model with mixed‑precision attention, is benchmarked on three accelerators—NVIDIA H100, Huawei Ascend 910C, and Ascend 950PR—showing that the 950PR delivers the lowest per‑token cost in both Prefill and Decode phases, while the H100 offers the highest raw performance at a far greater price.

Architects' Tech Alliance

May 4, 2026

DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C

1. Why DeepSeek‑V4 is demanding

DeepSeek‑V4 has a total parameter count of 1.6 trillion but uses a Mixture‑of‑Experts (MoE) architecture, activating only 49 B parameters per inference. This makes it appear heavyweight yet run efficiently. It also incorporates CSA+HCA hybrid attention, reducing inference FLOPs by 27 % compared with the previous V3.2 generation, cutting KV‑cache memory by 10 %, and using FP4 weights with FP8 computation for lower precision, memory, and compute requirements.

2. The three contenders

NVIDIA H100 – the high‑price powerhouse

FP8 performance: 1979 TFLOPS (the strongest in the set)

Memory: 80 GB HBM3, bandwidth 3.35 TB/s

Price: $30 000–$40 000 per card, hourly rental $3.5

Position: industry leader, ecosystem‑rich, but extremely expensive and hard to procure

Huawei Ascend 910C – the awkward middle

FP8 support: none (explicitly not supported)

FP16 performance: roughly 800 TFLOPS

Memory: 64–96 GB, bandwidth varies

Price: cheap, but fundamentally mismatched with DeepSeek‑V4

Huawei Ascend 950PR – the cost‑effective dark horse

FP8 performance: 1000 TFLOPS (native support)

Memory: 128 GB HiBL 1.0 (largest in the group)

Inter‑connect bandwidth: 2 TB/s (outperforms H100 by two lanes)

Price: $13 700 per card, hourly rental $1.2

Position: domestically produced, optimized for MoE long‑context workloads, price‑cutting

3. Real‑world benchmark: Prefill vs Decode

Prefill (reading the prompt) – compute‑bound

Token throughput:

H100: 70 700 tok/s (FP8 full‑throttle)

950PR: 35 700 tok/s (steady FP8)

910C: 28 600 tok/s (FP16 only)

Cost per million tokens (Prefill):

H100: $1.38

910C: $1.94 (41 % more than H100)

950PR: $0.93 (cheapest)

Decode (generating output) – memory‑bandwidth‑bound

Token throughput:

H100: 3 400 tok/s

950PR: 2 000 tok/s

910C: 1 200–3 600 tok/s (highly variable)

Cost per million tokens (Decode):

H100: $28.6

910C: $30.9 (more expensive)

950PR: $16.7 (42 % cheaper than H100)

Conclusion from the two phases: the 950PR beats both competitors, being ~30 % cheaper in Prefill and ~50 % cheaper in Decode.

4. Scaling to a 10 k QPS inference cluster

H100: requires 3 000 cards, total cost $105 million, annual fee $92 million

910C: requires 4 500 cards, total cost $112 million, annual fee $79 million

950PR: requires 3 500 cards, total cost $48 million, annual fee $37 million

Thus, the 950PR’s procurement cost is roughly half of the H100 and even less than half of the 910C, freeing a substantial budget for other operational expenses.

With DeepSeek‑V4’s API pricing of $12 per million input tokens and $24 per million output tokens, hardware costs on a 950PR deployment account for only 5 %–12 % of the total price, leaving the majority as profit margin.

5. Bottom line

Ascend 950PR: price ≈ 1/3 of H100, performance ≈ 50 %–60 % of H100, 128 GB memory, native FP8, large inter‑connect – the clear cost‑performance champion.

NVIDIA H100: unmatched raw performance but 1.5 ×–1.7 × the cost of 950PR, subject to export controls and supply risk; suited for well‑funded users who must stay in the CUDA ecosystem.

Ascend 910C: lacks FP8, slower, and ends up the most expensive per token; best relegated to training or non‑V4 workloads.

Overall, for DeepSeek‑V4 inference the Ascend 950PR provides the best value, delivering sufficient performance at a fraction of the cost of the high‑end H100.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MoE FP8 Nvidia H100 inference cost DeepSeek V4 Huawei Ascend 950PR

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Why DeepSeek‑V4 is demanding

2. The three contenders

NVIDIA H100 – the high‑price powerhouse

Huawei Ascend 910C – the awkward middle

Huawei Ascend 950PR – the cost‑effective dark horse

3. Real‑world benchmark: Prefill vs Decode

Prefill (reading the prompt) – compute‑bound

Decode (generating output) – memory‑bandwidth‑bound

4. Scaling to a 10 k QPS inference cluster

5. Bottom line

Architects' Tech Alliance

How this landed with the community

Was this worth your time?

0 Comments

4. Scaling to a 10 k QPS inference cluster