Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies

This article introduces Bottom‑up Policy Optimization (BuPO), a novel reinforcement‑learning framework that treats large language models as collections of internal layer and modular policies, revealing distinct inference entropy patterns in Llama and Qwen‑3 and demonstrating superior performance on complex mathematical reasoning benchmarks.

PaperAgent
PaperAgent
PaperAgent
Unveiling Bottom‑up Policy Optimization: Boosting LLM Reasoning with Internal Strategies

Large language models (LLMs) combined with reinforcement learning (RL) are typically treated as a single monolithic policy, and most RL work focuses on surface‑level reward design. This overlooks the hierarchical internal mechanisms of LLMs.

Internal Policies

An LLM generates each token by sampling from a probability distribution obtained from the final hidden state multiplied by the unembedding matrix. By applying the logit‑lens insight and the additive decomposition of transformer residual streams, any intermediate hidden state—or the output of a specific module such as self‑attention or feed‑forward network (FFN)—can be combined with the same unembedding matrix to produce a samplable distribution. These distributions are called Internal Layer Policies or Internal Modular Policies . This view enables analysis of how reasoning emerges across layers and suggests that optimizing these internal processes could improve overall performance.

Internal Policy Entropy

The authors introduce Internal Policy Entropy to measure the entropy of an internal policy. They also define Internal Policy Entropy Change as the difference between the entropy of a module’s output and its input, indicating whether the module adds uncertainty (exploration) or drives convergence.

Key Observations

Universal Entropy Flow: Across model families, lower layers maintain high entropy (exploration) while top layers quickly collapse to near‑zero entropy for final prediction.

Model‑specific patterns:

Llama series: Later FFN layers show a slight positive entropy change, suggesting shallow, dispersed exploration without strong intermediate integration.

Qwen‑3 series: Exhibits a three‑stage pattern— exploration (entropy increase), integration (entropy stabilises near zero), and convergence (entropy decrease). This structured reasoning aligns with human‑like progressive problem solving and may explain Qwen‑3’s superior knowledge absorption.

Figure 1: Transformer residual streams decompose into additive contributions from lower layers, enabling extraction of intermediate hidden states; the language‑model policy consists of multiple internal policies.

Optimizing Samplable Internal Policies

Treating each internal layer policy as an optimizable strategy yields several phenomena:

Internal policies capture higher‑level reasoning information early, aligning and refining features for downstream layers.

They compress internal reasoning uncertainty more effectively than optimizing only the final policy.

Excessive optimization of internal policies can cause performance collapse.

Figure 2: Entropy trajectories of internal policies for different model families; all retain high entropy early and converge later.

Bottom‑up Policy Optimization (BuPO)

Inspired by the bottom‑up emergence of reasoning, the authors propose a two‑phase training paradigm:

Bottom Alignment (early stage): Optimize fine‑grained internal layer policies, specifically FFN layers that exhibit positive exploration signals, to align low‑level features with reasoning goals.

Global Optimization (later stage): Switch to standard language‑model policy optimization to fine‑tune the overall output.

Algorithm 1: Bottom‑up Policy Optimization workflow.

Experimental Results

Experiments on complex mathematical reasoning benchmarks (MATH, AMC23, AIME24/25) demonstrate the effectiveness of BuPO:

Across Qwen‑3‑4B/8B and Llama‑OctoThinker models, BuPO consistently outperforms GRPO, PPO, Reinforce++, and RLOO.

On Qwen‑3‑4B, BuPO improves Avg@32 on AIME24 by 4.69 % relative to GRPO.

On Llama‑OctoThinker‑8B, MATH‑500 scores increase by 5.16 % .

BuPO achieves the best or second‑best Pass@K performance across a range of sampling settings, indicating robust generation quality.

Figure 3: Pass@K performance comparison; Figure 4: Dynamic entropy curves during BuPO training show expanded early‑stage exploration.

Conclusion

Bottom‑up Policy Optimization provides both an algorithmic advance and a new interpretability lens for LLMs. It reveals that an LLM’s policy is a composition of many intertwined internal policies rather than a single black‑box function. By optimizing these components from the bottom up, foundational reasoning abilities can be reconstructed, bridging interpretability research with reinforcement‑learning algorithm design.

Paper: https://arxiv.org/abs/2512.19673

Code: https://github.com/Trae1ounG/BuPO

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsreinforcement learningAI researchInterpretabilityBottom-up OptimizationInternal Policy
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.