Why DeepSeek R1 Rivals ChatGPT o1: Architecture, Training, and Cost Insights
This article provides a detailed technical analysis of DeepSeek's R1 large language model, covering its background, architecture, training methods, hardware optimizations, performance claims, user impressions, deployment options, and the challenges of reproducing its results.
Introduction
On 10 January 2025 DeepSeek announced the R1 large language model (LLM), positioning its inference cost at under $6 M and claiming performance comparable to OpenAI’s ChatGPT o1. The model weights and inference code are publicly available, while the training code and hardware‑optimisation code remain closed.
Model Families and Releases
DeepSeek released two model families:
V3 – the third‑generation general‑purpose LLM.
R1 – an inference‑tuned version built on V3‑Base.
Both families have variants based on Meta’s Llama architecture and Alibaba’s Qwen architecture, and distilled versions that can run on consumer‑grade hardware. Weights and inference code are hosted on Hugging Face and GitHub.
Architecture of V3‑Base
V3‑Base employs a Mixture‑of‑Experts (MoE) design similar to Mixtral but with higher efficiency:
671 billion parameters (≈ 671 B).
FP8 mixed‑precision quantisation.
128 K context window.
Trained on 14.8 trillion tokens.
Training was performed on NVIDIA H800 GPUs (80 GB memory, ~400 GB/s NVLink bandwidth) using several custom optimisations:
FP8 mixed‑precision to reduce memory footprint.
Custom cross‑node all‑to‑all kernels for efficient InfiniBand/NVLink utilisation.
DualPipe pipeline‑parallel algorithm that minimises pipeline bubbles and overlaps communication with computation.
R1 Construction
R1 is derived from V3‑Base through:
Supervised fine‑tuning (SFT).
Reinforcement learning (RL) using a rule‑based reward system called GRPO (Group‑Relative Policy Optimisation).
Long‑chain reasoning mode during inference.
Subsequent distillation into smaller dense models (e.g., 32‑B version runnable via ollama run deepseek-r1:32b).
Training Cost Estimates
DeepSeek reports a training cost of $5.58 M for R1, derived from the V3 cost figure of 2.788 million GPU‑hours (≈ $5.6 M). This figure likely represents only the final full‑run of V3‑Base; the cumulative cost for R1, which builds on V3‑Base, is expected to be higher. Reported hardware configurations vary, with mentions of up to 50 000 A100‑class GPUs (compared to OpenAI’s 25 000 A100s for GPT‑4). At a market rate of $1.35 per GPU‑hour, 50 000 GPUs would cost roughly $1.35 × 24 × 7 ≈ $113 M per week.
Unique Engineering Choices
Hardware‑Aware Optimisation
Because the H800 GPUs have half the inter‑GPU bandwidth and less memory than H100, DeepSeek combined FP8 training, custom communication kernels, and the DualPipe algorithm to achieve high training efficiency without resorting to expensive tensor‑parallelism.
Reinforcement Learning (GRPO)
GRPO applies rule‑based rewards rather than a large critic model, reducing memory overhead. It improves performance on objective tasks (e.g., coding, mathematics) but shows limited gains on subjective or open‑ended reasoning.
Multi‑Head Latent Attention (MLA)
Introduced in the V2 paper, MLA is a variant of multi‑head attention that claims to scale model size without sacrificing quality, offering a potential quality boost alongside MoE scaling.
Distillation vs. RL
Distilling a stronger teacher model into a smaller student yields better downstream performance than relying solely on large‑scale RL.
Further improvements may still require a more capable base model and larger RL budgets.
Lessons from Failed Approaches
Process‑reward models (PRM) do not scale for RL at the size of V3‑Base, though they can be useful for re‑ranking top‑N responses.
Monte‑Carlo Tree Search (MCTS) does not generalise to open‑ended reasoning tasks because the problem space is far less constrained than games like Go or Chess.
Reproducing R1 Results
Open‑source attempts (e.g., Hugging Face’s openR1) indicate that reproducing the reported performance would require:
Approximately 2 048 GPUs for a full training run.
Access to the proprietary training code (not released).
The massive, undisclosed training dataset (the largest missing component).
Community projects such as OpenThoughts are assembling synthetic datasets to approximate the missing data.
Reference
arXiv preprint: https://arxiv.org/abs/2501.12948
JavaEdge
First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
