SABER: Switchable and Balanced Training for Efficient LLM Reasoning
SABER introduces a reinforcement‑learning framework that lets large language models dynamically switch among four token‑budgeted reasoning modes, dramatically cutting inference length while preserving or improving accuracy across math, code, and logic tasks.
Overview
Chain‑of‑thought prompting improves large language model (LLM) reasoning but applying full reasoning to every query wastes tokens and increases latency. SABER (Switchable and Balanced Training for Efficient LLM Reasoning) is a reinforcement‑learning framework that gives LLMs a controllable, token‑budget‑aware inference capability.
Background
Chain‑of‑thought and test‑time compute scaling break problems into intermediate steps, but they often produce overly long reasoning traces (overthinking), even for trivial questions. Static rules or heuristics cannot adapt reasoning depth to problem difficulty or user preference.
Method
Thinking‑budget design and allocation
SABER first runs the base model on each training sample, counts tokens between <think> and </think>, and assigns the sample to one of three budget tiers: 128 tokens (easy), 4096 tokens (medium), and 16384 tokens (hard). Samples exceeding 16384 tokens receive no upper limit. The budget is communicated via system prompts, enabling the model to learn to switch modes during fine‑tuning.
Sample grouping and stability control
Accuracy‑based grouping: about 40% of samples that the base model cannot answer are split—half keep the original budget, half receive no budget limit, preventing early instability.
Length‑proportion constraint: generated thinking tokens must stay within a proportional range of the base model’s length, avoiding reward‑hacking by producing overly short traces.
No‑think mode construction
A small set of “no‑think” examples with an ultra‑short placeholder token block are added, teaching the model to skip explicit reasoning when instructed.
Direct RL optimization without SFT pre‑training
SABER trains the model with GRPO reinforcement learning, using a four‑part reward:
Format reward: outputs must wrap reasoning in <think>…</think> tags.
Answer reward: math answers are checked via boxed{}, code answers are validated by running tests.
Length penalty: exceeding the assigned budget incurs a penalty.
Proportion penalty: deviating too far from the base model’s length incurs a penalty, preventing reward‑hacking.
This enables precise control of reasoning depth while maintaining high‑quality answers across all modes.
Experiments
The authors evaluate SABER on four research questions (RQ1‑RQ4) using 1.5B and 7B model scales, covering math (MATH, GSM8K), code (MBPP), and logical reasoning (LiveBench‑Reasoning).
RQ1: Comparison with baselines
All SABER modes outperform the base model. FastThink reduces inference length by >70% while improving accuracy; CoreThink further raises overall accuracy; DeepThink retains most reasoning steps, achieves the highest accuracy, and still compresses length.
RQ2: Cross‑scale and cross‑domain generalization
On a 7B model, FastThink cuts length by >80% with minimal accuracy loss, and DeepThink adds modest accuracy gains. Despite training only on math and code data, the mode‑switching mechanism transfers to unseen logical reasoning tasks, demonstrating strong generalization.
RQ3: Ablation study
Removing budget downgrade harms short‑reasoning learning; dropping NoThink data degrades the no‑think mode without benefiting other modes; eliminating accuracy‑based filtering introduces noisy supervision and destabilizes training.
RQ4: Behavior analysis of reasoning modes
Examples from MATH‑500 show that FastThink provides only essential steps, CoreThink adds reflective explanations, and DeepThink includes post‑answer self‑verification and summarization, illustrating progressively deeper reasoning styles.
Conclusion
SABER demonstrates that a switchable, budget‑aware training paradigm can endow LLMs with efficient, controllable reasoning without extra supervised fine‑tuning. Experiments confirm high accuracy, smooth degradation under tighter budgets, and cross‑task generalization, making SABER a promising direction for cost‑effective large‑model inference.
https://arxiv.org/abs/2508.10026
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
