13 min read

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.

Bilibili Tech

Dec 19, 2025

SABER: Switchable and Balanced Training for Efficient LLM Reasoning

Overview

Chain‑of‑thought prompting improves large language model (LLM) reasoning but applying full reasoning to every query wastes tokens and increases latency. SABER (Switchable and Balanced Training for Efficient LLM Reasoning) is a reinforcement‑learning framework that gives LLMs a controllable, token‑budget‑aware inference capability.

Background

Chain‑of‑thought and test‑time compute scaling break problems into intermediate steps, but they often produce overly long reasoning traces (overthinking), even for trivial questions. Static rules or heuristics cannot adapt reasoning depth to problem difficulty or user preference.

Method

Thinking‑budget design and allocation

SABER first runs the base model on each training sample, counts tokens between <think> and </think>, and assigns the sample to one of three budget tiers: 128 tokens (easy), 4096 tokens (medium), and 16384 tokens (hard). Samples exceeding 16384 tokens receive no upper limit. The budget is communicated via system prompts, enabling the model to learn to switch modes during fine‑tuning.

System prompts for different thinking modes

Sample grouping and stability control

Accuracy‑based grouping: about 40% of samples that the base model cannot answer are split—half keep the original budget, half receive no budget limit, preventing early instability.

Length‑proportion constraint: generated thinking tokens must stay within a proportional range of the base model’s length, avoiding reward‑hacking by producing overly short traces.

No‑think mode construction

A small set of “no‑think” examples with an ultra‑short placeholder token block are added, teaching the model to skip explicit reasoning when instructed.

Direct RL optimization without SFT pre‑training

SABER trains the model with GRPO reinforcement learning, using a four‑part reward:

Format reward: outputs must wrap reasoning in <think>…</think> tags.

Answer reward: math answers are checked via boxed{}, code answers are validated by running tests.

Length penalty: exceeding the assigned budget incurs a penalty.

Proportion penalty: deviating too far from the base model’s length incurs a penalty, preventing reward‑hacking.

This enables precise control of reasoning depth while maintaining high‑quality answers across all modes.

Experiments

The authors evaluate SABER on four research questions (RQ1‑RQ4) using 1.5B and 7B model scales, covering math (MATH, GSM8K), code (MBPP), and logical reasoning (LiveBench‑Reasoning).

RQ1: Comparison with baselines

All SABER modes outperform the base model. FastThink reduces inference length by >70% while improving accuracy; CoreThink further raises overall accuracy; DeepThink retains most reasoning steps, achieves the highest accuracy, and still compresses length.

RQ2: Cross‑scale and cross‑domain generalization

On a 7B model, FastThink cuts length by >80% with minimal accuracy loss, and DeepThink adds modest accuracy gains. Despite training only on math and code data, the mode‑switching mechanism transfers to unseen logical reasoning tasks, demonstrating strong generalization.

RQ3: Ablation study

Removing budget downgrade harms short‑reasoning learning; dropping NoThink data degrades the no‑think mode without benefiting other modes; eliminating accuracy‑based filtering introduces noisy supervision and destabilizes training.

RQ4: Behavior analysis of reasoning modes

Examples from MATH‑500 show that FastThink provides only essential steps, CoreThink adds reflective explanations, and DeepThink includes post‑answer self‑verification and summarization, illustrating progressively deeper reasoning styles.

Conclusion

SABER demonstrates that a switchable, budget‑aware training paradigm can endow LLMs with efficient, controllable reasoning without extra supervised fine‑tuning. Experiments confirm high accuracy, smooth degradation under tighter budgets, and cross‑task generalization, making SABER a promising direction for cost‑effective large‑model inference.

https://arxiv.org/abs/2508.10026

LLM chain of thought Reinforcement learning Efficient Reasoning Budgeted Computation Switchable Inference