How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Recent research shows that applying reinforcement learning to large language models can dramatically improve inference performance, but its effectiveness depends on the token distribution produced during pre‑training, prompting a novel rewrite of cross‑entropy as a single‑step policy gradient with controllable entropy parameters.

PaperAgent
PaperAgent
PaperAgent
How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions

Pain Points in Current LLM Training

Traditional pipelines follow a three‑stage process: pre‑training, supervised fine‑tuning (SFT), and reinforcement learning (RL). The pre‑training objective is fixed to cross‑entropy, which does not tailor the token distribution for downstream RL exploration. Intuitively, high‑entropy distributions are thought to aid RL exploration, yet this lacks systematic validation. Moreover, reward shaping is usually applied only during the RL stage, leaving the pre‑training phase unable to intervene in token distribution.

Core Method: Rewriting Cross‑Entropy as a Policy Gradient

The authors formalize next‑token prediction as a single‑step Markov decision process (MDP) with the following correspondences:

State sₜ : the prefix X<t

Action aₜ : the next token

Policy πθ : the language model

Reward r(sₜ,aₜ) : a designable signal (standard cross‑entropy is equivalent to a sparse reward 1/π(xₜ))

From this formulation they derive a generalized reward function that introduces two key hyper‑parameters:

β : scales the positive‑sample reward; β<0 yields globally low‑entropy distributions, β>0 yields high‑entropy ones.

λ̃ / λ̂ : applies differentiated penalties or bonuses to top‑k negative samples, enabling local entropy fine‑tuning.

Experimental Design

Three training stages were evaluated:

Pre‑Train : 500 B tokens, models ranging from 1 B to 10 B parameters (dense and MoE variants). Metrics: perplexity, entropy, Pass@64.

Mid‑Train : 100 B tokens, same model families. Metric: average knowledge and reasoning scores.

RLVR : 1 k steps of GRPO on 4 B/10 B models. Metrics: Avg@128, Cons@128, Pass@64.

Key Results

5.1 Pre‑Training – Low‑Entropy Models Perform Better Later

Setting β = ‑0.25 (low entropy) after 500 B tokens improves mathematical Pass@64 by 1.3–2.0 points. High‑entropy settings (β = 0.5) show modest early gains but slower scaling.

5.2 Mid‑Training – Low‑Entropy Advantage Extends to Knowledge Tasks

For a 4 B dense model, β = ‑0.25 yields Knowledge Avg = 41.37 and Reasoning Avg = 60.35, outperforming the standard CE baseline (β = 0) which scores 41.30 and 59.73 respectively.

5.3 RL Stage – Low‑Entropy Prior Leads to Higher Upper Bounds and Smoother Entropy Curves

On the AIME‑24 benchmark, the 4 B dense model with β = ‑0.25 improves Avg@128 by 0.95 points. Entropy curves show that high‑entropy configurations collapse early, reducing response length, whereas low‑entropy settings maintain stable performance.

Key Insights

High entropy does not guarantee better exploration; a low‑entropy prior provides sharper initial signals, reducing wasted exploration.

Suppressing negative samples can hurt diversity; preserving top‑k negative rewards (λ̃ = 0.1) actually raises Pass@k.

The pre‑training objective can be reshaped to influence RL exploration, establishing a causal chain from pre‑training shaping → RL search space → final inference performance.

Takeaway

Cross‑entropy should be viewed as the zeroth‑step policy gradient rather than a dead loss; by deliberately lowering entropy during pre‑training, models find inference paths faster, more accurately, and more stably during subsequent RL fine‑tuning—a potential new paradigm for “RL‑ready” pre‑training beyond 2025.

https://arxiv.org/pdf/2512.22955
Diversity or Precision? A Deep Dive into Next Token Prediction
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Model OptimizationLLMreinforcement learningpretrainingRLToken Distribution
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.