Artificial Intelligence 12 min read

DeepSeek‑R1: Training Pipeline, Reinforcement‑Learning Techniques, and Experimental Results

The article reviews DeepSeek‑R1’s training methodology—including cold‑start data collection, multi‑stage RL fine‑tuning, SFT data generation, and model distillation—highlights its performance comparable to OpenAI‑o1‑1217, and discusses key contributions, reward design, successful experiments, and failed attempts.

Top Architect

Feb 9, 2025

DeepSeek‑R1: Training Pipeline, Reinforcement‑Learning Techniques, and Experimental Results

DeepSeek‑R1 presents a practical approach for achieving long‑chain and complex reasoning in large language models (LLMs) through a largely unsupervised reinforcement‑learning (RL) pipeline, accompanied by a detailed technical implementation and several experimental insights.

Goal : Explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on self‑evolution via a pure RL process.

Key Resources :

Arxiv paper: https://arxiv.org/abs/2501.12948</code>
<code>ModelScope paper: https://modelscope.cn/papers/109508</code>
<code>GitHub repository: https://github.com/deepseek-ai/DeepSeek-R1/tree/main

Training Pipeline (summarized from the paper):

Collect a few thousand high‑quality cold‑start examples and fine‑tune the DeepSeek‑V3‑Base model (model A).

Apply GRPO (a variant of PPO) on model A to induce reasoning ability, yielding model B.

Generate high‑quality SFT data with model B, mix it with other domain data from DeepSeek‑V3, and form a large curated dataset.

Fine‑tune the original DeepSeek‑V3‑Base on this dataset to obtain model C.

Repeat step 2 using model C and the full‑domain dataset, producing the final DeepSeek‑R1 (model D).

Distill knowledge from model C into smaller models, achieving strong performance without additional RL.

The authors note that an initial attempt without cold‑start data (direct GRPO on DeepSeek‑V3‑Base) improved chain‑of‑thought (CoT) ability but produced noisy, multilingual outputs, motivating the refined pipeline above.

Major Contributions :

Demonstrated that skipping supervised fine‑tuning (SFT) and using GRPO‑based RL alone can match or exceed SFT performance, suggesting RL’s larger role in LLM training.

Introduced a pipeline of RL → SFT → RL → distillation that can guide future model training.

Showed that high‑quality distilled data dramatically benefits smaller models, emphasizing data quality over sheer quantity.

Reward Design (ORM) :

Correctness reward: evaluates final answer correctness, including code execution results.

Format reward: requires the model to place the CoT process within a designated format.

The authors discuss challenges such as sparse, non‑continuous rewards potentially hindering policy convergence.

Experimental Findings :

DeepSeek‑R1‑Zero (the early version without cold‑start SFT) achieved dramatic gains on benchmarks like AIME (15.6% → 71%) without any supervised data, highlighting the power of RL‑driven training.

An “aha moment” was observed where the model learned to allocate more thinking time by re‑evaluating its initial approach.

Distilling large‑model data into smaller models outperformed direct RL training of small models, confirming the efficiency of knowledge distillation.

Unsuccessful Attempts :

PRM (a process‑reward model) proved ineffective or even detrimental, likely due to non‑differentiable components and reward hacking.

Monte‑Carlo Tree Search (MCTS) failed because the token‑level action space in NLP is too large for naive MCTS, leading to explosion in next‑token dimensions and unstable training.

Overall, the study suggests that while RL can unlock strong reasoning abilities, large‑scale model distillation remains a cost‑effective and reliable path for improving smaller models, and future breakthroughs may still require more powerful base models and extensive RL computation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM DeepSeek Model Training reinforcement learning AI research RLHF

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.