Why DeepSeek V3 Achieves Low Training Costs: Inside Its AI Innovations
This article provides a comprehensive analysis of DeepSeek's large‑language‑model technology, covering the company's background, model capabilities, remarkably low training and inference costs, and the core architectural and algorithmic innovations such as MoE, MLA attention, FP8 mixed‑precision, and the DualPipe pipeline that enable efficient large‑scale AI deployment.
1. About DeepSeek Company and Its Large Model
1.1 Company Overview
DeepSeek was founded in July 2023 in Hangzhou as a subsidiary of Huanfang Quantitative, officially named Hangzhou Deep Search Artificial Intelligence Fundamental Technology Research Co., Ltd.
1.2 Model Capabilities
DeepSeek models are positioned against domestic Qwen, overseas Llama, and GPT‑4o. According to published benchmark results, DeepSeek‑V3 ranks first among open‑source models and rivals the most advanced closed‑source models.
1.3 Training and Inference Cost
Inference cost (API pricing): 1 CNY per million input tokens.
Training cost: DeepSeek used NVIDIA H800 GPUs (about 2 000 units) for V3 training, with total cost not exceeding 6 million USD.
1. Pre‑training phase: 1 trillion tokens required 2 048 H800 GPUs for 180 K GPU‑hours (~3.7 days). 2. Total pre‑training time: 2 664 K GPU‑hours (~2 months); with context expansion and fine‑tuning, total ≈2 788 K GPU‑hours. 3. Assuming $2 per H800 GPU‑hour, total training cost stays below 6 million USD.
2. DeepSeek Training and Inference Core Technologies
2.1 DeepSeek‑V3 Model Network Architecture
DeepSeek‑V3 was trained on 148 trillion high‑quality tokens, followed by SFT and RL. The model has 671 B parameters, but each token activates only 37 B parameters. To achieve efficient inference and training, DeepSeek‑V3 introduces a custom MLA attention mechanism and a MoE architecture without auxiliary loss balancing.
2.1.1 DeepSeekMoE
The MoE layer replaces the traditional Feed‑Forward Network (FFN) in a Transformer. It consists of a gating network and several expert sub‑networks (e.g., 8 experts). Experts are typically FFNs but can be more complex, allowing hierarchical MoE structures.
● Sparse MoE layer: substitutes the standard FFN with multiple expert networks. ● Gating network: decides which tokens are routed to which experts.
Compared with classic MoE, DeepSeekMoE uses finer‑grained experts and isolates some as shared experts to reduce knowledge redundancy.
2.1.2 MLA Multi‑Head Latent Attention
Standard Transformer MHA generates large KV caches, limiting inference efficiency. MLA compresses keys and values jointly via low‑rank projection, drastically reducing KV cache size while maintaining performance.
MLA uses a sigmoid to compute affinity scores, selects top‑K scores, and normalizes them to generate gating values.
Low‑rank compression process:
2.2 Training and Inference Core Techniques
2.2.1 Training Framework HAI‑LLM
DeepSeek‑V3 was trained on a cluster with 2 048 NVIDIA H800 GPUs using the proprietary HAI‑LLM framework, which supports four parallelism strategies: ZeRO data parallelism, pipeline parallelism, tensor‑slice model parallelism, and sequence parallelism.
2.2.2 DualPipe Innovative Pipeline Parallel Algorithm
DeepSeek‑V3 employs 16‑way pipeline parallelism across 8 nodes (64‑way expert parallelism) and ZeRO‑1 data parallelism. DualPipe reduces pipeline bubbles and overlaps forward/backward computation with communication, mitigating the heavy communication overhead of expert parallelism.
Key idea: split each pipeline block into four components—attention, all‑to‑all scheduling, MLP, and all‑to‑all aggregation—and overlap their compute and communication phases.
Example: while block A performs forward computation, block B can simultaneously handle backward communication.
2.2.3 Mixed‑Precision FP8 Training Framework
Most core matrix‑multiply (GEMM) operations run in FP8, while sensitive operations (embedding, output head, MoE gating, normalization, attention) retain higher precision (BF16 or FP32). This balances efficiency and numerical stability.
Core GEMM in FP8.
High‑precision for embedding, output head, MoE gating, normalization, attention.
Fine‑grained quantization: activations use group‑wise 1×128 quantization; weights use 128×128 block‑wise quantization.
2.2.4 Multi‑Token Prediction (MTP) Training Objective
DeepSeek‑V3 trains with a multi‑token prediction objective, which improves performance on most benchmarks and can also accelerate inference.
2.2.5 Inference Deployment Scheme
With 671 B parameters, DeepSeek‑V3 adopts a split strategy: pre‑fill (prompt processing) and decode (autoregressive token generation). The deployment is distributed across machines.
Prefill stage: 4‑node unit, each node with 32 GPUs; attention uses 4‑way tensor parallelism (TP4) + sequence parallelism (SP) + 8‑way data parallelism (DP8). MoE uses 32‑way expert parallelism.
Decoder stage: 40‑node unit (320 GPUs); attention uses TP4 + SP + 80‑way data parallelism; MoE uses 320‑way expert parallelism, with each GPU hosting one expert and additional shared experts for redundancy.
3. Why DeepSeek V3 Training Cost Is So Low
The low cost stems from three main innovations:
MLA mechanism: Joint low‑rank compression of KV caches dramatically reduces memory usage.
FP8 training: Mixed‑precision FP8 cuts GPU memory and compute overhead while maintaining accuracy.
MoE architecture: Sparse activation reduces overall FLOPs, and custom communication optimizations mitigate expert‑parallel overhead.
These techniques together enable DeepSeek to train a 671 B‑parameter model for under 6 million USD.
4. Why DeepSeek?
DeepSeek demonstrates that a Chinese AI startup can match or surpass leading Western models, highlighting strong talent organization, solid engineering fundamentals, and a focus on research over immediate commercialization.
5. Personal Reflections
Future AI‑specific chips may be designed for Transformer architectures, similar to ASICs for convolutions.
Multi‑token prediction and MoE will remain hot research topics for large models.
In China, AI applications often outpace fundamental research, but the gap with overseas research is narrowing.
Hardware‑software co‑design, as shown by DeepSeek, will accelerate low‑cost AI iteration industry‑wide.
The rapid development leaves many details worth deeper study; errors are inevitable.
References
Better & Faster Large Language Models via Multi‑token Prediction
https://kexue.fm/archives/10091
https://arxiv.org/pdf/2404.19737v1
https://arxiv.org/pdf/2412.19437
https://arxiv.org/pdf/2405.04434
https://www.zhihu.com/question/8423473404
https://arxiv.org/pdf/1811.06965
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
