DeepSeek‑R1 Costs $294K to Train, Hits Nature Cover as First Peer‑Reviewed Large Model
DeepSeek‑R1, the first mainstream large language model to pass peer review in Nature, was trained for $294,000 using 648 H800 GPUs, and its RL‑enhanced version, DeepSeek‑R1‑Zero, achieved up to 86.7% pass@1 on AIME 2024, outperforming human averages across math, coding, and science tasks.
DeepSeek‑R1 on Nature Cover and Peer‑Reviewed Milestone
On September 17, DeepSeek‑R1’s research was featured on the cover of Nature , becoming the first mainstream large language model to undergo independent peer review in a top scientific journal. The peer‑review process involved external experts questioning the authors and requesting additional information under editorial supervision, marking a first for LLM research.
Beyond the scientific contribution, the paper disclosed the model’s training cost: $294,000 in total. Training DeepSeek‑R1‑Zero used 648 H800 GPUs for about 198 hours, while DeepSeek‑R1 itself required another 648 H800 GPUs for roughly 80 hours (≈4 days). An additional 5,000 GPU‑hours were spent building the SFT dataset.
Why Reinforcement Learning Was Chosen Over Conventional Supervised Fine‑Tuning
Large‑scale LLM inference capability traditionally demands massive compute during pre‑training. While Chain‑of‑Thought prompting and manually annotated reasoning traces can improve performance, they suffer from limited scalability and human bias. To avoid these constraints, DeepSeek adopted a reinforcement‑learning (RL) framework called Group Relative Policy Optimization (GRPO), skipping the usual supervised fine‑tuning (SFT) stage based on the hypothesis that unrestricted RL can foster emergent reasoning abilities.
DeepSeek‑R1‑Zero: RL‑Driven Reasoning Enhancements
The RL‑trained variant, DeepSeek‑R1‑Zero, generates longer, self‑reflective answers. It first outputs a reasoning segment under a “Think” tag, then provides the final answer under an “Answer” tag. A rule‑based reward system evaluates accuracy and format, guiding stable and scalable training. GRPO reduces the resource overhead of Proximal Policy Optimization (PPO) by estimating baselines from group scores instead of using a full‑size evaluation model.
Benchmark Performance
On the AIME 2024 mathematics competition, DeepSeek‑R1‑Zero’s pass@1 score rose from an initial 15.6 % to 77.9 %, and with self‑consistent decoding it reached 86.7 %, surpassing the average human contestant. The model also performed strongly on programming contests and graduate‑level biology, physics, and chemistry problems, confirming the effectiveness of RL‑based reasoning improvement.
Training Dynamics and Self‑Evolution
During RL training, the model’s average reasoning length continuously grew, and it learned to pause, inspect, and revise its own steps, exhibiting reflective reasoning and systematic exploration of alternative solutions.
Multi‑Stage Pipeline for DeepSeek‑R1
Collect dialogue‑aligned cold‑start data with DeepSeek‑V3 and feed into DeepSeek‑R1 Dev1.
Perform RL and sampling on Dev1; incorporate reasoning and non‑reasoning data into SFT for Dev2.
Enter a second RL phase (Dev3) to enhance usefulness and harmlessness before producing final answers.
Comparative Evaluation Across Development Stages
Comparing DeepSeek‑R1‑Zero with the earlier Dev1 checkpoint shows significant gains in instruction‑following metrics; DeepSeek‑R1 achieves higher scores on IF‑Eval and Arena‑Hard benchmarks.
Significance of Peer Review and Transparency
Nature’s editorial highlighted peer review as a safeguard against AI hype. Subbarao Kanbhampati, former AAAI president, participated in the review and praised the trend, encouraging more frontier models to share technical details of peer review. Technology outlet Wind Info noted that the paper adds transparency to training processes and resolves earlier distillation issues, setting a precedent for future AI research.
References
https://www.nature.com/articles/d41586-025-03015-6
https://www.nature.com/articles/d41586-025-02979-9
https://www.nature.com/articles/s41586-025-09422
HyperAI Super Neural
Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
