How DeepSeek‑R1 Uses Pure Reinforcement Learning to Match OpenAI’s o1
This article presents DeepSeek‑R1 and DeepSeek‑R1‑Zero, two next‑generation LLMs trained with pure reinforcement learning and multi‑stage fine‑tuning, details their GRPO training framework, model‑distillation pipeline, open‑source release, and evaluation results that rival OpenAI’s o1‑1217 across reasoning, knowledge, and coding benchmarks.
Introduction
The paper introduces DeepSeek‑R1 and its zero‑shot variant DeepSeek‑R1‑Zero, a new generation of large language models (LLMs) whose reasoning ability is enhanced solely through reinforcement learning (RL) without any supervised fine‑tuning. The models are built on the DeepSeek‑V3‑Base backbone and aim to close the gap with OpenAI’s o1 series.
Method
Reinforcement Learning with GRPO
Training uses Group Relative Policy Optimization (GRPO), which replaces a separate critic model with a group‑based baseline estimated from a set of sampled outputs. For each query q, GRPO samples a group {o₁,…,o_G} from the old policy π_θ_old and maximizes a reward‑adjusted objective to update the policy π_θ.
Reward Modeling
Accuracy reward : evaluates the correctness of the model’s answer.
Format reward : forces the model to wrap its reasoning steps in special tags such as <think>…</think> to improve readability.
Model Variants
DeepSeek‑R1‑Zero is trained purely with RL on the base model, achieving strong reasoning behavior but showing weaknesses in readability and language mixing. DeepSeek‑R1 adds a cold‑start data phase and two rounds of supervised fine‑tuning (SFT) before and after RL, which mitigates those issues and further boosts performance.
Distillation
The authors distilled DeepSeek‑R1 into six smaller models (1.5B, 7B, 8B, 14B, 32B, 70B) using model‑distillation techniques, enabling compact models to inherit the strong reasoning capabilities of the large teacher.
Evaluation Results
Reasoning tasks : DeepSeek‑R1 achieves 79.8% Pass@1 on AIME 2024 (slightly above OpenAI‑o1‑1217) and 97.3% Pass@1 on MATH‑500, outperforming most competitors.
Knowledge tasks : Competitive scores on MMLU, MMLU‑Pro, and GPQA Diamond, slightly below o1‑1217 but above other closed‑source models.
Programming & other tasks : 2,029 Elo on Codeforces (top 3.7%), 87.6% win rate on AlpacaEval 2.0, 92.3% on ArenaHard, and strong performance on creative writing, QA, and summarization.
Long‑context tasks : Superior results on AlpacaEval 2.0 and LiveCodeBench compared with DeepSeek‑V3.
Conclusion and Future Work
The study demonstrates that pure RL can substantially improve LLM reasoning without supervised data, and that distillation can transfer this ability to much smaller models. Future research will focus on reducing language‑mixing, extending to multi‑turn dialogue, improving performance on software‑engineering tasks, and further optimizing efficiency.
All models and code are released publicly for research use.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
