Artificial Intelligence 9 min read

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

The paper introduces ReBalance, a training‑free two‑stage inference control framework that uses model confidence signals to dynamically balance reasoning depth, achieving up to a 10‑point accuracy gain and a 35.4% reduction in token length across multiple LLM sizes and benchmarks.

Machine Heart

Apr 26, 2026

Balanced Thinking: Boost LLM Accuracy by 10% While Cutting Inference Length 35%

Recent LLM research has highlighted the "overthinking" problem, where models continue redundant reasoning after reaching the correct answer, and "underthinking", where they halt prematurely. Both extremes harm efficiency and accuracy.

Balanced Thinking: Redefining Efficient Reasoning

The authors propose the concept of Balanced Thinking, asserting that efficient inference is not about blindly shortening reasoning chains but about maintaining a dynamic equilibrium between excessive and insufficient thought.

ReBalance Framework

ReBalance implements a training‑free, two‑stage control process:

Offline data collection: A single forward pass on a small labeled set records step‑level confidence and its variance. Steps showing over‑ or under‑thinking are identified, and hidden‑state prototypes for each condition are extracted. The difference between prototypes forms a steering vector that encodes the direction between the two imbalance states.

Online dynamic steering: During inference, the current step’s confidence and variance are continuously monitored. A control function, fitted to model behavior, decides the steering direction and magnitude. Low confidence with high variance triggers reinforcement to converge faster, while high confidence with low variance triggers reverse steering to encourage deeper exploration.

The method requires no additional training, auxiliary models, or extra inference stages.

Key Finding: Confidence as a Reliable Continuous Signal

Analysis of step‑level confidence and local confidence variance reveals distinct patterns: overthinking exhibits fluctuating confidence across steps, whereas underthinking shows consistently high confidence with little variance. This demonstrates that confidence can serve as an online, fine‑grained indicator for dynamic inference control.

Experimental Validation

Experiments were conducted on four slow‑thinking LLM scales (0.5B–32B) across nine benchmarks covering mathematical reasoning, general QA, and code generation. Results show that ReBalance improves Pass@1 accuracy by up to 10.0 points while reducing generated token length by up to 35.4%.

Specific findings include:

Mathematical reasoning tasks: +10.0% accuracy, -35.4% token count.

GPQA‑D: +6.6% accuracy, -29.9% token count.

StrategyQA and LiveCodeBench: consistent cross‑domain gains.

Unlike prior length‑penalty methods that truncate both correct and incorrect samples, ReBalance adaptively trims redundant steps for correct paths while preserving necessary reasoning for uncertain paths.

Additional validation on Ascend 910B NPU (openPangu slow‑thinking mode) on the AIME‑2025 benchmark shows a 3.4% accuracy increase with a 35.3% token reduction, confirming deployment feasibility.

Conclusion

ReBalance demonstrates that treating efficient inference as a balanced control problem, guided by native confidence signals, yields simultaneous gains in speed and performance. The project is open‑source, with a public demo and pre‑computed steering vectors to lower reproducibility barriers.