DeepSeek‑V4 Inference Cost Showdown: NVIDIA H100 vs Ascend 950PR vs 910C
DeepSeek‑V4, a 1.6‑trillion‑parameter MoE model with mixed‑precision attention, is benchmarked on three accelerators—NVIDIA H100, Huawei Ascend 910C, and Ascend 950PR—showing that the 950PR delivers the lowest per‑token cost in both Prefill and Decode phases, while the H100 offers the highest raw performance at a far greater price.
1. Why DeepSeek‑V4 is demanding
DeepSeek‑V4 has a total parameter count of 1.6 trillion but uses a Mixture‑of‑Experts (MoE) architecture, activating only 49 B parameters per inference. This makes it appear heavyweight yet run efficiently. It also incorporates CSA+HCA hybrid attention, reducing inference FLOPs by 27 % compared with the previous V3.2 generation, cutting KV‑cache memory by 10 %, and using FP4 weights with FP8 computation for lower precision, memory, and compute requirements.
2. The three contenders
NVIDIA H100 – the high‑price powerhouse
FP8 performance: 1979 TFLOPS (the strongest in the set)
Memory: 80 GB HBM3, bandwidth 3.35 TB/s
Price: $30 000–$40 000 per card, hourly rental $3.5
Position: industry leader, ecosystem‑rich, but extremely expensive and hard to procure
Huawei Ascend 910C – the awkward middle
FP8 support: none (explicitly not supported)
FP16 performance: roughly 800 TFLOPS
Memory: 64–96 GB, bandwidth varies
Price: cheap, but fundamentally mismatched with DeepSeek‑V4
Huawei Ascend 950PR – the cost‑effective dark horse
FP8 performance: 1000 TFLOPS (native support)
Memory: 128 GB HiBL 1.0 (largest in the group)
Inter‑connect bandwidth: 2 TB/s (outperforms H100 by two lanes)
Price: $13 700 per card, hourly rental $1.2
Position: domestically produced, optimized for MoE long‑context workloads, price‑cutting
3. Real‑world benchmark: Prefill vs Decode
Prefill (reading the prompt) – compute‑bound
Token throughput:
H100: 70 700 tok/s (FP8 full‑throttle)
950PR: 35 700 tok/s (steady FP8)
910C: 28 600 tok/s (FP16 only)
Cost per million tokens (Prefill):
H100: $1.38
910C: $1.94 (41 % more than H100)
950PR: $0.93 (cheapest)
Decode (generating output) – memory‑bandwidth‑bound
Token throughput:
H100: 3 400 tok/s
950PR: 2 000 tok/s
910C: 1 200–3 600 tok/s (highly variable)
Cost per million tokens (Decode):
H100: $28.6
910C: $30.9 (more expensive)
950PR: $16.7 (42 % cheaper than H100)
Conclusion from the two phases: the 950PR beats both competitors, being ~30 % cheaper in Prefill and ~50 % cheaper in Decode.
4. Scaling to a 10 k QPS inference cluster
H100: requires 3 000 cards, total cost $105 million, annual fee $92 million
910C: requires 4 500 cards, total cost $112 million, annual fee $79 million
950PR: requires 3 500 cards, total cost $48 million, annual fee $37 million
Thus, the 950PR’s procurement cost is roughly half of the H100 and even less than half of the 910C, freeing a substantial budget for other operational expenses.
With DeepSeek‑V4’s API pricing of $12 per million input tokens and $24 per million output tokens, hardware costs on a 950PR deployment account for only 5 %–12 % of the total price, leaving the majority as profit margin.
5. Bottom line
Ascend 950PR: price ≈ 1/3 of H100, performance ≈ 50 %–60 % of H100, 128 GB memory, native FP8, large inter‑connect – the clear cost‑performance champion.
NVIDIA H100: unmatched raw performance but 1.5 ×–1.7 × the cost of 950PR, subject to export controls and supply risk; suited for well‑funded users who must stay in the CUDA ecosystem.
Ascend 910C: lacks FP8, slower, and ends up the most expensive per token; best relegated to training or non‑V4 workloads.
Overall, for DeepSeek‑V4 inference the Ascend 950PR provides the best value, delivering sufficient performance at a fraction of the cost of the high‑end H100.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
