How DeepSeek‑R1’s Reinforcement Learning Redefined LLM Reasoning (Nature Cover Story)
DeepSeek‑R1, the first peer‑reviewed large language model, landed on Nature’s cover after a novel reinforcement‑learning‑only training pipeline that dramatically boosted reasoning performance while keeping training costs surprisingly low.
Overview
DeepSeek‑R1 is a large language model built on the DeepSeek‑V3 Base backbone. The authors replaced the conventional supervised fine‑tuning (SFT) stage with a minimalist reinforcement‑learning (RL) framework that defines only a task format and a reward signal.
Training Framework
The RL framework requires two elements:
Task format
Structure: each answer must contain a <think> block with the reasoning process and an <answer> block with the final answer.
Reward signal
Correctness reward: the model receives a reward solely based on whether the final answer is correct, regardless of the reasoning path.
GRPO Algorithm
DeepSeek introduced Group Relative Policy Optimization (GRPO) instead of the resource‑intensive Proximal Policy Optimization (PPO). For each query the model generates a set of answer candidates (e.g., 16), computes each answer’s advantage relative to the group average, and scales rewards accordingly. This reduces computational overhead while preserving training stability.
Reward Design
Rule‑based rewards
Accuracy: the final answer must exactly match the ground‑truth for math and coding tasks.
Format: the reasoning must be wrapped in <think> tags and the answer in <answer> tags.
Model‑based rewards
Usefulness model: a learned reward model evaluates whether the final answer is helpful and on‑topic. It is trained on pairs of good and bad responses generated by DeepSeek‑V3.
Safety model: another model checks the entire output, including the reasoning chain, for harmful, biased, or dangerous content.
Training Stages
Cold start: thousands of high‑quality dialogue examples teach basic conversational behavior.
First RL stage: focuses on boosting reasoning ability and adds a language‑consistency incentive for Chinese inputs.
Large‑scale SFT: mixes reasoning data with massive non‑reasoning data (writing, general QA, code) to broaden knowledge and general capabilities.
Second RL stage: re‑applies RL with both rule‑based and model‑based rewards, lowers the sampling temperature to 0.7, and introduces model‑based rewards only in the final 400 steps to avoid reward exploitation.
Key Hyper‑parameters
Learning rate: 3 × 10⁻⁶
KL‑divergence coefficient: 0.001
GRPO clipping ratio ε: 10
Inference temperature: 1 (reduced to 0.7 in the second RL stage)
Batch size: 512 (32 questions per step, each with 16 answer candidates)
Maximum token length increased from 32,768 to 65,536 at step 8,200, yielding a noticeable jump in answer length and quality.
Performance
On the AIME 2024 benchmark, pass@1 accuracy rose from 15.6 % at the start of training to 77.9 % after the first RL stage and reached 86.7 % when combined with self‑consistency decoding, surpassing average human performance. Additional improvements of 17‑25 % were observed on AlpacaEval 2.0 and Arena‑Hard.
Emergent Reasoning Behaviors
Longer thinking chains: the length of the <think> segment grew to hundreds of tokens as the model iteratively refined its solution.
Advanced strategies: the model began to self‑reflect, explore alternative solutions, and perform systematic “what‑if” analyses, demonstrating capabilities beyond linear step‑by‑step solving.
Challenges and Future Work
Capability limits: the model still struggles with structured output, tool use (e.g., calculators, search engines), and is highly sensitive to prompt phrasing; it performs best in zero‑shot settings.
Reward hacking: pure RL success hinges on reliable reward signals; designing robust rewards for subjective tasks such as poetry remains an open problem.
References
Nature paper: https://www.nature.com/articles/s41586-025-09422
Nature commentary: https://www.nature.com/articles/d41586-025-03015-6
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
