Artificial Intelligence 24 min read

DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DeepSeek R1 builds on the DeepSeek V3 base model using a multi‑stage reinforcement learning pipeline—including GRPO optimization, rule‑based reward modeling, supervised fine‑tuning, language‑consistency rewards, rejection sampling, and distillation—to produce a high‑performing, aligned LLM capable of accurate reasoning.

Architect
Architect
Architect
DeepSeek R1: Multi‑Stage Reinforcement Learning, Reward Modeling, and Distillation for a High‑Performance LLM

DeepSeek R1 is not trained from scratch; it starts from the DeepSeek V3 mixture‑of‑experts (MoE) large language model and applies a series of reinforcement‑learning (RL) stages to turn it into a reasoning‑focused assistant.

DeepSeek V3 consists of a memory system that retrieves relevant context and a routing mechanism that directs each input either to a fast processor for simple queries or to an expert system for complex reasoning, making it a hybrid MoE model.

In the first RL stage, DeepSeek V3 acts as the actor (policy) while the training pipeline—named GRPO—samples multiple outputs from an older policy (the “old policy”). Each sampled output is evaluated by a rule‑based reward model that checks answer correctness and proper use of <think> and <answer> tags.

The reward model combines several components: an accuracy reward for correct answers, a format reward for correctly structured tags, a language‑consistency reward that encourages the output language to match the query language, and stability terms (StayStablePart, KL‑penalty, clipping). Advantages are computed by comparing each output’s reward to the average reward of the batch, and policy ratios guide the update of the new policy.

Example of the expected structured output for the arithmetic problem “2 + 3 × 4” is:

<think>Order of operations: multiply before add. 3 * 4 = 12. 2 + 12 = 14</think>
<answer>14</answer>

Supervised fine‑tuning (SFT) follows a cold‑start data collection phase where high‑quality chain‑of‑thought (CoT) examples are created with special tokens. A few‑shot prompt such as:

Problem: What is 2 + 3 * 4?
Solution: | special_token | Following order of operations (PEMDAS), multiply first: 3*4=12. Then add: 2+12=14. | special_token | Summary: The answer is 14.

These examples teach the model to generate step‑by‑step reasoning and final answers in a consistent format.

Before the second SFT stage, DeepSeek employs rejection sampling: many candidate outputs are generated, then filtered by the reward model for correctness, completeness of reasoning, and language consistency. Only the top‑quality samples (≈600 k) are kept for further training.

The final alignment RL stage adds usefulness and harmlessness rewards, combines diverse data (reasoning, QA, writing), and uses human‑preferred comparisons to fine‑tune the model with GRPO, achieving a balance between strong reasoning ability and safe behavior.

After the RL and SFT phases, the trained DeepSeek‑R1 model serves as a teacher for distillation. About 800 k inference samples are generated, and smaller student models (e.g., Qwen‑1.5B, Llama‑14B) are fine‑tuned to mimic the teacher’s outputs, resulting in compact models that retain most of DeepSeek‑R1’s reasoning capabilities.

DeepSeekreinforcement learningLLM TrainingModel DistillationReward Modeling
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.