Does Reinforcement Learning Really Expand Reasoning Capacity in Large Language Models? Insights from Recent Empirical Study
Recent empirical research by Tsinghua’s LeapLab and Shanghai Jiao Tong University reveals that reinforcement‑learning‑based fine‑tuning (RLVR) improves sampling efficiency but does not extend the fundamental reasoning abilities of large language models beyond their base capabilities, as demonstrated across mathematics, code, and visual reasoning benchmarks.
Recent work by the LeapLab team at Tsinghua University together with Shanghai Jiao Tong University investigates whether reinforcement‑learning‑based fine‑tuning (RLVR) can give large language models (LLMs) reasoning abilities that surpass their underlying base models.
The study conducts systematic experiments across three domains—mathematical reasoning, code generation, and visual reasoning—using multiple LLM families (e.g., Qwen‑2.5, LLaMA‑3.1) and their RL‑trained variants. The core finding is that all correct reasoning paths produced by RLVR models already exist in the base model’s output distribution; RL merely makes sampling of high‑reward paths more efficient.
When evaluating with the pass@k metric (the probability of obtaining at least one correct answer within k samples), RL models outperform the base model at very small k (e.g., pass@1), but as k grows to dozens or hundreds, the base model consistently catches up and eventually surpasses the RL‑trained models. This pattern holds across all three tasks, indicating that RL improves efficiency but reduces the overall reasoning coverage.
Further analysis shows that RL‑trained models tend to focus on a narrower set of answer trajectories, leading to a reduction in “reasoning breadth.” Perplexity measurements confirm that the high‑reward paths favored by RL are already present in the base model’s distribution, and the RL process does not generate novel reasoning strategies.
The authors compare several RL algorithms (PPO, GRPO, Reinforce++, RLOO, DAPO, ReMax) and find only minor differences in sampling efficiency, with none achieving a clear superiority. They also contrast RLVR with knowledge distillation, noting that distillation can genuinely expand a model’s reasoning frontier, whereas RLVR cannot.
In summary, while RLVR can make LLMs solve certain problems faster, it does not enable them to solve problems that the base model cannot; the apparent gains are limited to sampling efficiency rather than true capability expansion.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.