Breaking Homogeneous Reasoning: I²B‑LPO Guides RLVR from Repeated Sampling to Effective Exploration
I²B‑LPO is an exploration‑enhancement framework for RLVR that branches rollouts at high‑entropy nodes, injects latent variables via pseudo self‑attention, and filters paths with an information‑bottleneck self‑reward, achieving up to 5.3% accuracy and 7.4% diversity improvements on multiple math reasoning benchmarks.
Overview
I²B‑LPO is a post‑training exploration‑enhancement framework for Reinforcement Learning with Verifiable Reward (RLVR). It improves rollout strategies to generate more diverse reasoning trajectories, moving exploration from "repeated sampling" to "high‑distinguishability paths" and boosting both accuracy (up to 5.3%) and semantic diversity (up to 7.4%) on several mathematical benchmarks.
Theoretical and Phenomenon Analysis
1. High‑entropy nodes are true branching points. Token‑level entropy grouping experiments show that when the model is in a high‑entropy region, performance differences between decoding strategies amplify, indicating that high‑entropy positions correspond to critical decision points suitable for branching.
2. Reasoning length does not equal effective reasoning. In standard GRPO training, accuracy plateaus early while trajectory length and 4‑gram repetition keep rising, suggesting the model may generate longer but redundant content rather than genuinely useful reasoning.
Core Innovations
I²B‑LPO combines two mechanisms:
Entropy‑driven latent variable branching at high‑entropy ("hesitation") nodes.
Information‑bottleneck (IB) self‑reward that ranks and filters generated paths.
Method Details
Entropy‑driven latent branching. For each initial rollout, the framework identifies high‑entropy "hesitation" nodes (entropy H_t) and samples latent variables at those points. A pseudo self‑attention (PSA) module injects the latent variables into RMSNorm scaling, then maps them to additional Key and Value vectors, influencing subsequent generation.
The entropy at step t, H_t, measures uncertainty of the next token; higher H_t indicates a "hesitant" position likely to admit multiple reasoning directions. Positions with entropy above a threshold τ are selected as branching points.
Latent variables are injected via PSA: they modulate RMSNorm scaling (γ(t) decays over generation), are projected to extra Key/Value, and concatenated with original attention. The final attention computation becomes:
PSA thus provides a "latent reasoning hint" that steers the rollout into distinct, high‑information paths.
Information‑bottleneck self‑reward. After generating candidate trajectories, I²B‑LPO scores each using an IB metric that balances brevity and answer relevance:
Higher scores indicate concise, effective paths; low‑IB trajectories are filtered out (e.g., empty‑fluff, repetitive loops, logical drift). The top‑N trajectories are kept for GRPO policy updates.
Experiments
Training data. Samples are drawn from DAPO and MATH, filtered for difficulty and length, leaving 6,486 MATH and 13,583 DAPO examples.
Benchmarks. Evaluations are performed on AIME2025/2024, MATH‑500, OlympiadBench, and GSM8K.
AIME2025/2024 – high‑school competition problems.
MATH‑500 – diverse algebra, geometry, number theory, probability tasks.
OlympiadBench – olympiad‑level, long‑chain reasoning.
GSM8K – middle‑school arithmetic.
Tables show that I²B‑LPO consistently improves both accuracy and diversity across model scales (Qwen2.5‑7B, Qwen3‑14B) and datasets, confirming the claimed 5.3% accuracy and 7.4% diversity gains.
Entropy distribution analysis (Fig. 3) reveals that standard GRPO collapses to low‑entropy regions, while I²B‑LPO maintains a balanced entropy profile, preventing premature convergence to a single reasoning template.
Attention‑head activation visualizations (Fig. 4) show that PSA‑injected latent variables activate deeper heads relevant to difficult problems, unlike shallow or softmax‑only perturbations.
Failure‑mode analysis of low‑IB trajectories identifies three typical issues: empty‑fluff, repetitive loops, and logical drift. High‑IB trajectories are shorter, more direct, and each step contributes to the final answer.
Conclusion
The study demonstrates that standard random rollouts lead to homogeneous reasoning templates, weakening reward signals. I²B‑LPO addresses this by branching at high‑entropy nodes and applying an information‑bottleneck self‑reward, thereby achieving more efficient and reliable exploration in RLVR and improving both accuracy and diversity on a range of mathematical reasoning benchmarks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
