How DeepSeek‑R1 Uses Pure Reinforcement Learning to Match OpenAI’s o1

This article presents DeepSeek‑R1 and DeepSeek‑R1‑Zero, two next‑generation LLMs trained with pure reinforcement learning and multi‑stage fine‑tuning, details their GRPO training framework, model‑distillation pipeline, open‑source release, and evaluation results that rival OpenAI’s o1‑1217 across reasoning, knowledge, and coding benchmarks.

Architect
Architect
Architect
How DeepSeek‑R1 Uses Pure Reinforcement Learning to Match OpenAI’s o1

Introduction

The paper introduces DeepSeek‑R1 and its zero‑shot variant DeepSeek‑R1‑Zero, a new generation of large language models (LLMs) whose reasoning ability is enhanced solely through reinforcement learning (RL) without any supervised fine‑tuning. The models are built on the DeepSeek‑V3‑Base backbone and aim to close the gap with OpenAI’s o1 series.

Method

Reinforcement Learning with GRPO

Training uses Group Relative Policy Optimization (GRPO), which replaces a separate critic model with a group‑based baseline estimated from a set of sampled outputs. For each query q, GRPO samples a group {o₁,…,o_G} from the old policy π_θ_old and maximizes a reward‑adjusted objective to update the policy π_θ.

Reward Modeling

Accuracy reward : evaluates the correctness of the model’s answer.

Format reward : forces the model to wrap its reasoning steps in special tags such as <think>…</think> to improve readability.

Model Variants

DeepSeek‑R1‑Zero is trained purely with RL on the base model, achieving strong reasoning behavior but showing weaknesses in readability and language mixing. DeepSeek‑R1 adds a cold‑start data phase and two rounds of supervised fine‑tuning (SFT) before and after RL, which mitigates those issues and further boosts performance.

Distillation

The authors distilled DeepSeek‑R1 into six smaller models (1.5B, 7B, 8B, 14B, 32B, 70B) using model‑distillation techniques, enabling compact models to inherit the strong reasoning capabilities of the large teacher.

Evaluation Results

Reasoning tasks : DeepSeek‑R1 achieves 79.8% Pass@1 on AIME 2024 (slightly above OpenAI‑o1‑1217) and 97.3% Pass@1 on MATH‑500, outperforming most competitors.

Knowledge tasks : Competitive scores on MMLU, MMLU‑Pro, and GPQA Diamond, slightly below o1‑1217 but above other closed‑source models.

Programming & other tasks : 2,029 Elo on Codeforces (top 3.7%), 87.6% win rate on AlpacaEval 2.0, 92.3% on ArenaHard, and strong performance on creative writing, QA, and summarization.

Long‑context tasks : Superior results on AlpacaEval 2.0 and LiveCodeBench compared with DeepSeek‑V3.

Conclusion and Future Work

The study demonstrates that pure RL can substantially improve LLM reasoning without supervised data, and that distillation can transfer this ability to much smaller models. Future research will focus on reducing language‑mixing, extending to multi‑turn dialogue, improving performance on software‑engineering tasks, and further optimizing efficiency.

All models and code are released publicly for research use.

large language modelsDeepSeekreinforcement learningmodel distillationLLM evaluationOpenAI o1
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.