How DeepSeek‑R1’s Reinforcement Learning Redefined LLM Reasoning (Nature Cover Story)

DeepSeek‑R1, the first peer‑reviewed large language model, landed on Nature’s cover after a novel reinforcement‑learning‑only training pipeline that dramatically boosted reasoning performance while keeping training costs surprisingly low.

Data Party THU
Data Party THU
Data Party THU
How DeepSeek‑R1’s Reinforcement Learning Redefined LLM Reasoning (Nature Cover Story)

Overview

DeepSeek‑R1 is a large language model built on the DeepSeek‑V3 Base backbone. The authors replaced the conventional supervised fine‑tuning (SFT) stage with a minimalist reinforcement‑learning (RL) framework that defines only a task format and a reward signal.

Training Framework

The RL framework requires two elements:

Task format

Structure: each answer must contain a <think> block with the reasoning process and an <answer> block with the final answer.

Reward signal

Correctness reward: the model receives a reward solely based on whether the final answer is correct, regardless of the reasoning path.

GRPO Algorithm

DeepSeek introduced Group Relative Policy Optimization (GRPO) instead of the resource‑intensive Proximal Policy Optimization (PPO). For each query the model generates a set of answer candidates (e.g., 16), computes each answer’s advantage relative to the group average, and scales rewards accordingly. This reduces computational overhead while preserving training stability.

Reward Design

Rule‑based rewards

Accuracy: the final answer must exactly match the ground‑truth for math and coding tasks.

Format: the reasoning must be wrapped in <think> tags and the answer in <answer> tags.

Model‑based rewards

Usefulness model: a learned reward model evaluates whether the final answer is helpful and on‑topic. It is trained on pairs of good and bad responses generated by DeepSeek‑V3.

Safety model: another model checks the entire output, including the reasoning chain, for harmful, biased, or dangerous content.

Training Stages

Cold start: thousands of high‑quality dialogue examples teach basic conversational behavior.

First RL stage: focuses on boosting reasoning ability and adds a language‑consistency incentive for Chinese inputs.

Large‑scale SFT: mixes reasoning data with massive non‑reasoning data (writing, general QA, code) to broaden knowledge and general capabilities.

Second RL stage: re‑applies RL with both rule‑based and model‑based rewards, lowers the sampling temperature to 0.7, and introduces model‑based rewards only in the final 400 steps to avoid reward exploitation.

Key Hyper‑parameters

Learning rate: 3 × 10⁻⁶

KL‑divergence coefficient: 0.001

GRPO clipping ratio ε: 10

Inference temperature: 1 (reduced to 0.7 in the second RL stage)

Batch size: 512 (32 questions per step, each with 16 answer candidates)

Maximum token length increased from 32,768 to 65,536 at step 8,200, yielding a noticeable jump in answer length and quality.

Performance

On the AIME 2024 benchmark, pass@1 accuracy rose from 15.6 % at the start of training to 77.9 % after the first RL stage and reached 86.7 % when combined with self‑consistency decoding, surpassing average human performance. Additional improvements of 17‑25 % were observed on AlpacaEval 2.0 and Arena‑Hard.

Emergent Reasoning Behaviors

Longer thinking chains: the length of the <think> segment grew to hundreds of tokens as the model iteratively refined its solution.

Advanced strategies: the model began to self‑reflect, explore alternative solutions, and perform systematic “what‑if” analyses, demonstrating capabilities beyond linear step‑by‑step solving.

Challenges and Future Work

Capability limits: the model still struggles with structured output, tool use (e.g., calculators, search engines), and is highly sensitive to prompt phrasing; it performs best in zero‑shot settings.

Reward hacking: pure RL success hinges on reliable reward signals; designing robust rewards for subjective tasks such as poetry remains an open problem.

References

Nature paper: https://www.nature.com/articles/s41586-025-09422

Nature commentary: https://www.nature.com/articles/d41586-025-03015-6

DeepSeekmodel trainingGRPONature
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.