Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models
This article analyzes whether reinforcement learning enhances large language model reasoning, compares findings from DeepSeek-Math, a Tsinghua‑Shanghai Jiao‑Tong paper, and Qwen3, and outlines practical training pipelines—including Seed‑Thinking‑v1.5, DeepSeek‑R1, Kimi‑K1.5, and Qwen3—that aim to endow LLMs with robust reasoning capabilities.
Can RL Improve LLM Reasoning?
DeepSeek‑Math Findings
DeepSeek‑Math evaluates Instruct‑tuned and RL‑tuned models on two benchmarks using Pass@K (probability that at least one of the top‑K generated answers is correct) and Maj@K (majority vote among the top‑K). RL improves Maj@K but leaves Pass@K unchanged, indicating that RL stabilises the output distribution and pushes correct answers higher in the ranking without fundamentally enhancing the model’s intrinsic reasoning capability.
Tsinghua & Shanghai Jiao‑Tong Analysis
The paper shows that most reasoning paths produced after RL already exist in the base model’s sampling distribution. RL therefore acts as a biasing mechanism toward reward‑rich trajectories, improving sampling efficiency but narrowing the reasoning frontier compared to the untouched base model. Three empirical points are validated:
Reasoning patterns are present in the base model.
Distillation (e.g., DeepSeek‑R1‑Distill‑Qwen‑7B) can expand the reasoning boundary.
Different RL algorithms (PPO, GRPO, DAPO) exhibit varied effects on reasoning performance.
Methods for Endowing LLMs with Reasoning Ability
Seed‑Thinking‑v1.5 (Mixture‑of‑Experts)
Seed‑Thinking‑v1.5 is a MoE model with 200 B total parameters and 20 B activation parameters. The development focuses on three pillars: high‑quality RL data, a robust reward model, and scalable RL infrastructure.
RL Data Preparation
Reward Model
SFT Phase Data and Training
RL Phase Training
DeepSeek‑R1 Training Pipeline
Step 1 – Obtain DeepSeek‑R1‑Zero
Pure RL (no supervised data) is used with a prompt template (see image) to create a model called R1‑Zero . R1‑Zero then generates thousands of chain‑of‑thought (CoT) cold‑start examples that serve as training fuel for the final model.
Step 2 – Build DeepSeek‑R1
The full pipeline consists of six stages (including two data‑preparation stages):
Stage 0: Create DeepSeek‑R1‑Zero.
Stage 1: First supervised fine‑tuning (SFT) on V3‑Base using the CoT data from R1‑Zero, establishing basic format compliance and reflective verification.
Stage 2: First RL round to strengthen reasoning in mathematics, code, and logic.
Stage 2.5: Expand the dataset to 600 K domain‑wide examples plus 200 K non‑reasoning CoT examples.
Stage 3: Second SFT on V3‑Base with 800 K examples to improve generality.
Stage 4: Second RL round to align with human preferences, enhance safety, and refine reasoning.
Kimi‑K1.5 (Multimodal)
Kimi‑K1.5 follows a pipeline similar to Seed‑Thinking‑v1.5 but adds multimodal considerations.
RL Data Collection Directions
Diverse coverage: prompts span STEM, coding, and general reasoning.
Balanced difficulty: mix of easy, medium, and hard problems.
Accurate evaluability: prompts allow objective verification of reasoning correctness.
Sampling Strategy
RL Algorithm – Online Policy Mirror Descent
The authors remove the value model for two reasons: (1) to encourage exploration of longer reasoning chains, and (2) to simplify the RL loop.
Qwen3
Pre‑training Phase
Post‑training Phase – Four‑Stage Process
Stage 1 – Long CoT Cold‑Start : Fine‑tune on extensive CoT data covering math, programming, logic, and STEM to give the model a solid reasoning foundation.
Stage 2 – Reasoning‑Based RL : Scale up RL resources and apply rule‑based rewards to boost both exploration of new reasoning paths and exploitation of known good ones.
Stage 3 – Thinking‑Mode Fusion : Merge the Long CoT data with standard instruction‑following data. The resulting “Thinking” model generates data that seamlessly combines deep reasoning with fast response.
Stage 4 – General RL : Apply RL across 20+ general tasks (instruction compliance, format adherence, agent abilities) to improve overall capability and mitigate undesirable behaviours.
Key Takeaways for Building Strong Reasoning Models
The base model must be powerful (e.g., DeepSeek‑V3).
Training data should be comprehensive, high‑quality, and cover diverse reasoning domains.
Incorporate reasoning data in the final pre‑training stage to give the model exposure to CoT patterns.
Reward models that directly score CoT outputs yield better alignment.
Reasoning data should be present throughout both SFT and RL phases.
References
DeepSeek‑Math: https://arxiv.org/pdf/2402.03300
Tsinghua & Shanghai Jiao‑Tong paper: https://arxiv.org/pdf/2504.13837
Seed‑Thinking‑v1.5: https://arxiv.org/pdf/2504.13914v1
Kimi‑K1.5: https://arxiv.org/pdf/2501.12599
Qwen3 (Transformer docs): https://huggingface.co/docs/transformers/model_doc/qwen3Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
