Artificial Intelligence 10 min read

DeepSeek‑R1 Costs $294K to Train, Hits Nature Cover as First Peer‑Reviewed Large Model

DeepSeek‑R1, the first mainstream large language model to pass peer review in Nature, was trained for $294,000 using 648 H800 GPUs, and its RL‑enhanced version, DeepSeek‑R1‑Zero, achieved up to 86.7% pass@1 on AIME 2024, outperforming human averages across math, coding, and science tasks.

HyperAI Super Neural

Sep 18, 2025

DeepSeek‑R1 Costs $294K to Train, Hits Nature Cover as First Peer‑Reviewed Large Model

DeepSeek‑R1 on Nature Cover and Peer‑Reviewed Milestone

On September 17, DeepSeek‑R1’s research was featured on the cover of Nature , becoming the first mainstream large language model to undergo independent peer review in a top scientific journal. The peer‑review process involved external experts questioning the authors and requesting additional information under editorial supervision, marking a first for LLM research.

Beyond the scientific contribution, the paper disclosed the model’s training cost: $294,000 in total. Training DeepSeek‑R1‑Zero used 648 H800 GPUs for about 198 hours, while DeepSeek‑R1 itself required another 648 H800 GPUs for roughly 80 hours (≈4 days). An additional 5,000 GPU‑hours were spent building the SFT dataset.

Why Reinforcement Learning Was Chosen Over Conventional Supervised Fine‑Tuning

Large‑scale LLM inference capability traditionally demands massive compute during pre‑training. While Chain‑of‑Thought prompting and manually annotated reasoning traces can improve performance, they suffer from limited scalability and human bias. To avoid these constraints, DeepSeek adopted a reinforcement‑learning (RL) framework called Group Relative Policy Optimization (GRPO), skipping the usual supervised fine‑tuning (SFT) stage based on the hypothesis that unrestricted RL can foster emergent reasoning abilities.

DeepSeek‑R1‑Zero: RL‑Driven Reasoning Enhancements

The RL‑trained variant, DeepSeek‑R1‑Zero, generates longer, self‑reflective answers. It first outputs a reasoning segment under a “Think” tag, then provides the final answer under an “Answer” tag. A rule‑based reward system evaluates accuracy and format, guiding stable and scalable training. GRPO reduces the resource overhead of Proximal Policy Optimization (PPO) by estimating baselines from group scores instead of using a full‑size evaluation model.

Benchmark Performance

On the AIME 2024 mathematics competition, DeepSeek‑R1‑Zero’s pass@1 score rose from an initial 15.6 % to 77.9 %, and with self‑consistent decoding it reached 86.7 %, surpassing the average human contestant. The model also performed strongly on programming contests and graduate‑level biology, physics, and chemistry problems, confirming the effectiveness of RL‑based reasoning improvement.

AIME pass@1 comparison with human baseline

Training Dynamics and Self‑Evolution

During RL training, the model’s average reasoning length continuously grew, and it learned to pause, inspect, and revise its own steps, exhibiting reflective reasoning and systematic exploration of alternative solutions.

Multi‑Stage Pipeline for DeepSeek‑R1

Collect dialogue‑aligned cold‑start data with DeepSeek‑V3 and feed into DeepSeek‑R1 Dev1.

Perform RL and sampling on Dev1; incorporate reasoning and non‑reasoning data into SFT for Dev2.

Enter a second RL phase (Dev3) to enhance usefulness and harmlessness before producing final answers.

Comparative Evaluation Across Development Stages

Comparing DeepSeek‑R1‑Zero with the earlier Dev1 checkpoint shows significant gains in instruction‑following metrics; DeepSeek‑R1 achieves higher scores on IF‑Eval and Arena‑Hard benchmarks.

IF‑Eval and Arena‑Hard benchmark results

Significance of Peer Review and Transparency

Nature’s editorial highlighted peer review as a safeguard against AI hype. Subbarao Kanbhampati, former AAAI president, participated in the review and praised the trend, encouraging more frontier models to share technical details of peer review. Technology outlet Wind Info noted that the paper adds transparency to training processes and resolves earlier distillation issues, setting a precedent for future AI research.

References

https://www.nature.com/articles/d41586-025-03015-6

https://www.nature.com/articles/d41586-025-02979-9

https://www.nature.com/articles/s41586-025-09422

Large Language Model benchmark DeepSeek-R1 reinforcement learning AI research Peer Review training cost

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.