DeepSeek‑R1: Training Pipeline, Reinforcement‑Learning Techniques, and Experimental Results
The article reviews DeepSeek‑R1’s training methodology—including cold‑start data collection, multi‑stage RL fine‑tuning, SFT data generation, and model distillation—highlights its performance comparable to OpenAI‑o1‑1217, and discusses key contributions, reward design, successful experiments, and failed attempts.
DeepSeek‑R1 presents a practical approach for achieving long‑chain and complex reasoning in large language models (LLMs) through a largely unsupervised reinforcement‑learning (RL) pipeline, accompanied by a detailed technical implementation and several experimental insights.
Goal : Explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on self‑evolution via a pure RL process.
Key Resources :
Arxiv paper: https://arxiv.org/abs/2501.12948
ModelScope paper: https://modelscope.cn/papers/109508
GitHub repository: https://github.com/deepseek-ai/DeepSeek-R1/tree/mainTraining Pipeline (summarized from the paper):
Collect a few thousand high‑quality cold‑start examples and fine‑tune the DeepSeek‑V3‑Base model (model A).
Apply GRPO (a variant of PPO) on model A to induce reasoning ability, yielding model B.
Generate high‑quality SFT data with model B, mix it with other domain data from DeepSeek‑V3, and form a large curated dataset.
Fine‑tune the original DeepSeek‑V3‑Base on this dataset to obtain model C.
Repeat step 2 using model C and the full‑domain dataset, producing the final DeepSeek‑R1 (model D).
Distill knowledge from model C into smaller models, achieving strong performance without additional RL.
The authors note that an initial attempt without cold‑start data (direct GRPO on DeepSeek‑V3‑Base) improved chain‑of‑thought (CoT) ability but produced noisy, multilingual outputs, motivating the refined pipeline above.
Major Contributions :
Demonstrated that skipping supervised fine‑tuning (SFT) and using GRPO‑based RL alone can match or exceed SFT performance, suggesting RL’s larger role in LLM training.
Introduced a pipeline of RL → SFT → RL → distillation that can guide future model training.
Showed that high‑quality distilled data dramatically benefits smaller models, emphasizing data quality over sheer quantity.
Reward Design (ORM) :
Correctness reward: evaluates final answer correctness, including code execution results.
Format reward: requires the model to place the CoT process within a designated format.
The authors discuss challenges such as sparse, non‑continuous rewards potentially hindering policy convergence.
Experimental Findings :
DeepSeek‑R1‑Zero (the early version without cold‑start SFT) achieved dramatic gains on benchmarks like AIME (15.6% → 71%) without any supervised data, highlighting the power of RL‑driven training.
An “aha moment” was observed where the model learned to allocate more thinking time by re‑evaluating its initial approach.
Distilling large‑model data into smaller models outperformed direct RL training of small models, confirming the efficiency of knowledge distillation.
Unsuccessful Attempts :
PRM (a process‑reward model) proved ineffective or even detrimental, likely due to non‑differentiable components and reward hacking.
Monte‑Carlo Tree Search (MCTS) failed because the token‑level action space in NLP is too large for naive MCTS, leading to explosion in next‑token dimensions and unstable training.
Overall, the study suggests that while RL can unlock strong reasoning abilities, large‑scale model distillation remains a cost‑effective and reliable path for improving smaller models, and future breakthroughs may still require more powerful base models and extensive RL computation.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.