Artificial Intelligence 11 min read

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

This article critically examines recent RL‑for‑LLM studies, revealing that reinforcement learning improves search efficiency but does not extend the intrinsic reasoning capabilities of base models, and explores the underlying model‑conditioned optimization bias, comparisons with SFT distillation, and the trade‑off with catastrophic forgetting.

Baobao Algorithm Notes

Dec 7, 2025

Can RL Really Boost LLM Reasoning? A Critical Review of Recent Findings

Introduction

The author reviews several recent papers on reinforcement learning (RL) for large language models (LLMs) to understand whether RL truly enhances LLM reasoning beyond the capabilities of the underlying pretrained model.

Key Paper and Main Claim

The first highlighted work (NeurIPS best paper) poses the question: Can RL enable LLMs to surpass the reasoning ability of the base model? The authors conclude no . Their experiments show that RL‑based value‑ranking (RLVR) only makes the search process more efficient; the model’s ultimate capability remains bounded by the base model.

https://arxiv.org/abs/2504.13837

Evaluation via pass@k

The study uses the pass@k metric (generate k solutions, success if any passes). RL models outperform the base model at k=1, but as k grows the performance gap narrows and eventually the base model overtakes RL, indicating that RL does not expand the solution space, only improves the likelihood of finding existing solutions.

Generalization Across Methods and Tasks

The conclusion holds for various RL algorithms (PPO, GRPO, etc.), evaluation suites (math, code, visual reasoning), and model scales. RL narrows the coverage of reasoning paths, leading to poorer performance at larger k.

Why RL Fails to Extend Capability

The authors attribute the limitation to a model‑conditioned optimization bias . RL updates stay within regions preferred by the pretrained model, effectively acting as a “compass” that guides optimization along a narrow corridor.

Three‑Gate Theory

Gate I – On‑Policy KL Leash

RL constrains each update’s KL divergence to stay close to the original model distribution, akin to a leash that prevents large exploratory jumps.

Gate II – Model Geometry Determines KL‑Bounded Steps

The pretrained model’s parameter space exhibits structured geometry (high‑curvature directions). RL tends to avoid these directions, updating mainly along principal angles, preserving the spectral geometry of the model.

Gate III – Precision Acts as a Lens

Low‑precision formats (e.g., bfloat16) filter out tiny updates, making the bias appear sparse. In reality, many small updates occur but are zeroed out by limited precision.

https://arxiv.org/abs/2511.08567z

Comparison with SFT Distillation

Unlike RL, supervised fine‑tuning (SFT) or distillation can expand the model’s capability, allowing it to solve problems the base model cannot. The article presents visual evidence that RL models become “specialists”—highly accurate on a subset of tasks but performing worse on the rest—whereas SFT broadens competence.

Catastrophic Forgetting Trade‑off

RL training does not suffer from catastrophic forgetting, while SFT often does. A separate paper (RL’s Razor) shows SFT leads to severe forgetting, prompting a discussion of the inherent trade‑off between acquiring new abilities and retaining old ones.

https://arxiv.org/abs/2509.04259

On‑Policy Distillation as a Hybrid Approach

Recent work from Thinking Machines proposes “On‑Policy Distillation,” a hybrid that retains RL’s training dynamics while incorporating SFT‑style distillation. The hoped‑for benefits include expanding capability boundaries, efficient inference path search, and mitigating catastrophic forgetting.

Conclusion

The collection of papers suggests that RL alone does not enlarge LLM reasoning ability; it merely optimizes within the existing capability envelope. Understanding the model‑conditioned optimization bias explains why RL improves efficiency but not performance, and points toward hybrid methods or geometry‑aware algorithms as promising future directions.

Model Optimization LLM reinforcement learning SFT Catastrophic Forgetting

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.