How Reinforcement Learning Can Boost LLM Reasoning by Shaping Token Distributions
Recent research shows that applying reinforcement learning to large language models can dramatically improve inference performance, but its effectiveness depends on the token distribution produced during pre‑training, prompting a novel rewrite of cross‑entropy as a single‑step policy gradient with controllable entropy parameters.
Pain Points in Current LLM Training
Traditional pipelines follow a three‑stage process: pre‑training, supervised fine‑tuning (SFT), and reinforcement learning (RL). The pre‑training objective is fixed to cross‑entropy, which does not tailor the token distribution for downstream RL exploration. Intuitively, high‑entropy distributions are thought to aid RL exploration, yet this lacks systematic validation. Moreover, reward shaping is usually applied only during the RL stage, leaving the pre‑training phase unable to intervene in token distribution.
Core Method: Rewriting Cross‑Entropy as a Policy Gradient
The authors formalize next‑token prediction as a single‑step Markov decision process (MDP) with the following correspondences:
State sₜ : the prefix X<t
Action aₜ : the next token
Policy πθ : the language model
Reward r(sₜ,aₜ) : a designable signal (standard cross‑entropy is equivalent to a sparse reward 1/π(xₜ))
From this formulation they derive a generalized reward function that introduces two key hyper‑parameters:
β : scales the positive‑sample reward; β<0 yields globally low‑entropy distributions, β>0 yields high‑entropy ones.
λ̃ / λ̂ : applies differentiated penalties or bonuses to top‑k negative samples, enabling local entropy fine‑tuning.
Experimental Design
Three training stages were evaluated:
Pre‑Train : 500 B tokens, models ranging from 1 B to 10 B parameters (dense and MoE variants). Metrics: perplexity, entropy, Pass@64.
Mid‑Train : 100 B tokens, same model families. Metric: average knowledge and reasoning scores.
RLVR : 1 k steps of GRPO on 4 B/10 B models. Metrics: Avg@128, Cons@128, Pass@64.
Key Results
5.1 Pre‑Training – Low‑Entropy Models Perform Better Later
Setting β = ‑0.25 (low entropy) after 500 B tokens improves mathematical Pass@64 by 1.3–2.0 points. High‑entropy settings (β = 0.5) show modest early gains but slower scaling.
5.2 Mid‑Training – Low‑Entropy Advantage Extends to Knowledge Tasks
For a 4 B dense model, β = ‑0.25 yields Knowledge Avg = 41.37 and Reasoning Avg = 60.35, outperforming the standard CE baseline (β = 0) which scores 41.30 and 59.73 respectively.
5.3 RL Stage – Low‑Entropy Prior Leads to Higher Upper Bounds and Smoother Entropy Curves
On the AIME‑24 benchmark, the 4 B dense model with β = ‑0.25 improves Avg@128 by 0.95 points. Entropy curves show that high‑entropy configurations collapse early, reducing response length, whereas low‑entropy settings maintain stable performance.
Key Insights
High entropy does not guarantee better exploration; a low‑entropy prior provides sharper initial signals, reducing wasted exploration.
Suppressing negative samples can hurt diversity; preserving top‑k negative rewards (λ̃ = 0.1) actually raises Pass@k.
The pre‑training objective can be reshaped to influence RL exploration, establishing a causal chain from pre‑training shaping → RL search space → final inference performance.
Takeaway
Cross‑entropy should be viewed as the zeroth‑step policy gradient rather than a dead loss; by deliberately lowering entropy during pre‑training, models find inference paths faster, more accurately, and more stably during subsequent RL fine‑tuning—a potential new paradigm for “RL‑ready” pre‑training beyond 2025.
https://arxiv.org/pdf/2512.22955
Diversity or Precision? A Deep Dive into Next Token PredictionSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
