Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models

This article analyzes whether reinforcement learning enhances large language model reasoning, compares findings from DeepSeek-Math, a Tsinghua‑Shanghai Jiao‑Tong paper, and Qwen3, and outlines practical training pipelines—including Seed‑Thinking‑v1.5, DeepSeek‑R1, Kimi‑K1.5, and Qwen3—that aim to endow LLMs with robust reasoning capabilities.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Do Reinforcement Learning Techniques Really Boost LLM Reasoning? A Deep Dive into Recent Models

Can RL Improve LLM Reasoning?

DeepSeek‑Math Findings

DeepSeek‑Math evaluates Instruct‑tuned and RL‑tuned models on two benchmarks using Pass@K (probability that at least one of the top‑K generated answers is correct) and Maj@K (majority vote among the top‑K). RL improves Maj@K but leaves Pass@K unchanged, indicating that RL stabilises the output distribution and pushes correct answers higher in the ranking without fundamentally enhancing the model’s intrinsic reasoning capability.

Tsinghua & Shanghai Jiao‑Tong Analysis

The paper shows that most reasoning paths produced after RL already exist in the base model’s sampling distribution. RL therefore acts as a biasing mechanism toward reward‑rich trajectories, improving sampling efficiency but narrowing the reasoning frontier compared to the untouched base model. Three empirical points are validated:

Reasoning patterns are present in the base model.

Distillation (e.g., DeepSeek‑R1‑Distill‑Qwen‑7B) can expand the reasoning boundary.

Different RL algorithms (PPO, GRPO, DAPO) exhibit varied effects on reasoning performance.

Methods for Endowing LLMs with Reasoning Ability

Seed‑Thinking‑v1.5 (Mixture‑of‑Experts)

Seed‑Thinking‑v1.5 is a MoE model with 200 B total parameters and 20 B activation parameters. The development focuses on three pillars: high‑quality RL data, a robust reward model, and scalable RL infrastructure.

RL Data Preparation

Reward Model

SFT Phase Data and Training

RL Phase Training

DeepSeek‑R1 Training Pipeline

Step 1 – Obtain DeepSeek‑R1‑Zero

Pure RL (no supervised data) is used with a prompt template (see image) to create a model called R1‑Zero . R1‑Zero then generates thousands of chain‑of‑thought (CoT) cold‑start examples that serve as training fuel for the final model.

Step 2 – Build DeepSeek‑R1

The full pipeline consists of six stages (including two data‑preparation stages):

Stage 0: Create DeepSeek‑R1‑Zero.

Stage 1: First supervised fine‑tuning (SFT) on V3‑Base using the CoT data from R1‑Zero, establishing basic format compliance and reflective verification.

Stage 2: First RL round to strengthen reasoning in mathematics, code, and logic.

Stage 2.5: Expand the dataset to 600 K domain‑wide examples plus 200 K non‑reasoning CoT examples.

Stage 3: Second SFT on V3‑Base with 800 K examples to improve generality.

Stage 4: Second RL round to align with human preferences, enhance safety, and refine reasoning.

Kimi‑K1.5 (Multimodal)

Kimi‑K1.5 follows a pipeline similar to Seed‑Thinking‑v1.5 but adds multimodal considerations.

RL Data Collection Directions

Diverse coverage: prompts span STEM, coding, and general reasoning.

Balanced difficulty: mix of easy, medium, and hard problems.

Accurate evaluability: prompts allow objective verification of reasoning correctness.

Sampling Strategy

RL Algorithm – Online Policy Mirror Descent

The authors remove the value model for two reasons: (1) to encourage exploration of longer reasoning chains, and (2) to simplify the RL loop.

Qwen3

Pre‑training Phase

Post‑training Phase – Four‑Stage Process

Stage 1 – Long CoT Cold‑Start : Fine‑tune on extensive CoT data covering math, programming, logic, and STEM to give the model a solid reasoning foundation.

Stage 2 – Reasoning‑Based RL : Scale up RL resources and apply rule‑based rewards to boost both exploration of new reasoning paths and exploitation of known good ones.

Stage 3 – Thinking‑Mode Fusion : Merge the Long CoT data with standard instruction‑following data. The resulting “Thinking” model generates data that seamlessly combines deep reasoning with fast response.

Stage 4 – General RL : Apply RL across 20+ general tasks (instruction compliance, format adherence, agent abilities) to improve overall capability and mitigate undesirable behaviours.

Key Takeaways for Building Strong Reasoning Models

The base model must be powerful (e.g., DeepSeek‑V3).

Training data should be comprehensive, high‑quality, and cover diverse reasoning domains.

Incorporate reasoning data in the final pre‑training stage to give the model exposure to CoT patterns.

Reward models that directly score CoT outputs yield better alignment.

Reasoning data should be present throughout both SFT and RL phases.

References

DeepSeek‑Math: https://arxiv.org/pdf/2402.03300
Tsinghua & Shanghai Jiao‑Tong paper: https://arxiv.org/pdf/2504.13837
Seed‑Thinking‑v1.5: https://arxiv.org/pdf/2504.13914v1
Kimi‑K1.5: https://arxiv.org/pdf/2501.12599
Qwen3 (Transformer docs): https://huggingface.co/docs/transformers/model_doc/qwen3
Artificial IntelligenceLLMreasoningmodel trainingreinforcement learning
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.