Artificial Intelligence 11 min read

DeepSeek‑R1: Reinforcement‑Learning‑Driven Long‑Chain Reasoning for Large Language Models

The article reviews DeepSeek‑R1, detailing its reinforcement‑learning‑based training pipeline that uses minimal supervised data, cold‑start fine‑tuning, multi‑stage RL, rejection‑sampling SFT, and distillation to achieve reasoning performance comparable to OpenAI‑o1‑1217, while also discussing successful contributions and failed experiments.

Architect

Feb 6, 2025

DeepSeek‑R1: Reinforcement‑Learning‑Driven Long‑Chain Reasoning for Large Language Models

DeepSeek‑R1 presents a practical approach for achieving long‑chain and complex reasoning in large language models (LLMs) by relying almost entirely on reinforcement learning (RL) with very limited supervised data.

Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self‑evolution through a pure RL process.

The training pipeline consists of six main stages:

Collect a few thousand high‑quality cold‑start examples and fine‑tune the DeepSeek‑V3‑Base model (model A).

Apply GRPO‑based RL on model A to induce reasoning abilities, yielding model B.

Generate high‑quality SFT data with model B and combine it with existing DeepSeek‑V3 data to form a large, high‑quality dataset.

Fine‑tune the original DeepSeek‑V3‑Base model on this dataset, producing model C.

Repeat the RL step on model C using the expanded dataset across all domains, resulting in model D, which is the final DeepSeek‑R1.

Distill the knowledge from model C into a smaller model, achieving strong performance without additional RL.

The authors highlight several key contributions:

Skipping SFT and using GRPO‑based RL directly yields results comparable to supervised fine‑tuning, suggesting RL can replace SFT in many scenarios.

The RL‑SFT‑RL‑distillation pipeline offers a useful blueprint for training other LLMs.

High‑quality distilled data is crucial; better datasets lead to superior model performance, emphasizing the importance of model‑generated data.

Implementation details reveal that the reward model (ORM) evaluates final answer correctness and enforces proper chain‑of‑thought formatting, while also adding language‑consistency rewards to avoid multilingual mixing.

Experimental results show DeepSeek‑R1 matching or surpassing OpenAI‑o1‑1217 on benchmarks such as AIME and MATH, and an “aha moment” where the model learns to allocate more thinking time during inference.

Failed attempts include the limited impact of a process‑reward model (PRM) and the impracticality of Monte‑Carlo Tree Search (MCTS) for base‑model training due to token‑dimensional explosion.

Overall, the study demonstrates that large‑scale RL can dramatically improve reasoning in LLMs, but scaling to smaller models still benefits more from distilled data than from direct RL.

Arxiv paper URL: https://arxiv.org/abs/2501.12948</code>
<code>ModelScope paper URL: https://modelscope.cn/papers/109508</code>
<code>GitHub repository: https://github.com/deepseek-ai/DeepSeek-R1/tree/main

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSeek-R1 reinforcement learning AI research RLHF LLM reasoning

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.