38 min read

When Should Large Language Models Think? 10 Cutting‑Edge Strategies to Boost Reasoning Efficiency

This article reviews ten recent papers that tackle the over‑thinking problem in large language models by shortening chain‑of‑thought reasoning, introducing dynamic early‑exit, adaptive thinking triggers, and reinforcement‑learning‑based training, showing how models can maintain or improve accuracy while dramatically reducing token usage and latency.

Baobao Algorithm Notes

May 26, 2025

When Should Large Language Models Think? 10 Cutting‑Edge Strategies to Boost Reasoning Efficiency

Large language models have achieved impressive results on complex reasoning tasks, but overly long chain‑of‑thought (CoT) reasoning often leads to unnecessary computation, higher latency, and even reduced accuracy—a phenomenon termed “over‑thinking.” Recent research focuses on enabling models to think only when necessary, shortening reasoning chains, or dynamically deciding whether to invoke CoT.

TL;DR

Systematically surveys ten papers on reducing redundant reasoning and adaptive CoT triggering.

All works report that accuracy can be preserved or improved while token overhead and inference latency drop significantly.

Technical approaches fall into three categories: (1) directly shortening reasoning chains, (2) dynamic early‑exit, and (3) adaptive decision‑making for when to think.

1. Concise Reasoning via Reinforcement Learning

https://arxiv.org/abs/2504.05185

Method

The authors propose Concise Reasoning , a two‑stage RL fine‑tuning that rewards models for generating shorter yet correct reasoning steps. They show that standard RLHF tends to inflate answer length, while a dedicated reward can naturally encourage brevity without harming correctness.

Training & Inference

Stage 1 fine‑tunes on a small dataset; the reward function penalizes excessive tokens and rewards correct answers. Experiments reveal that the GRPO algorithm can become unstable, so careful reward design is required.

Experiments

On several math and logic benchmarks, the refined model reduces chain length dramatically while either maintaining or slightly improving accuracy, disproving the assumption that longer chains always yield better results.

Innovation & Limitations

Key contribution is exposing the length bias in RLHF and offering a practical two‑stage RL solution. Limitation: additional RL fine‑tuning requires extra compute and a small validation set.

2. Dynamic Early Exit in Reasoning Models (DEER)

https://arxiv.org/abs/2504.15895

Method

DEER monitors confidence signals (e.g., special "Wait" tokens) during generation and aborts the chain when the model is sufficiently confident, eliminating unnecessary steps.

Training & Inference

No extra training is needed; DEER is a pure inference‑time policy that evaluates intermediate confidence and triggers early stopping.

Experiments

Evaluated on ten reasoning benchmarks (GSM8K, MATH‑500, AMC, AIME, LiveCodeBench, etc.) across 11 state‑of‑the‑art models. DEER shortens reasoning chains by 19 %–80 % on average while modestly improving accuracy (0.3 %–5 %).

Innovation & Limitations

Strength lies in being training‑free and broadly applicable. However, it relies on clear confidence cues; tasks lacking such signals may see limited benefit, and premature stopping could miss corrective steps.

3. Reasoning Models Can Be Effective Without Thinking (NoThinking)

https://arxiv.org/abs/2504.09858

Method

The paper introduces a simple prompting trick called NoThinking that skips the CoT entirely and directly outputs the answer. Parallel sampling (multiple direct answers) plus a result‑aggregation step further boosts reliability.

Training & Inference

No additional training; the approach relies on prompt engineering and optional parallel generation.

Experiments

On seven challenging reasoning datasets, NoThinking under a fixed token budget outperforms traditional CoT on several tasks (e.g., 51.3 vs. 28.9 on AMC 2023) and scales well with parallel sampling.

Innovation & Limitations

Shows that for many problems, explicit reasoning is unnecessary. Limitation: lacks verification for open‑ended or creative tasks where correctness cannot be automatically checked.

4. ShorterBetter: Guiding Models to Find Optimal Inference Length

https://arxiv.org/abs/2504.21370

Method

Introduces the Sample Optimal Length (SOL) metric: for each question, the shortest correct answer among multiple samples defines the target length. RL rewards models for matching SOL, encouraging self‑regulated pruning of redundant steps.

Training & Inference

Unsupervised RL where the model repeatedly solves each sample, discovers SOL, and receives length‑matching rewards. Applied to DeepSeek‑Distill‑Qwen (1.5B & 7B) without architectural changes.

Experiments

On math reasoning tasks, output length drops 50 %–80 % while accuracy stays stable, even on out‑of‑domain data.

Innovation & Limitations

Key novelty is the SOL reward, which is smooth and task‑adaptive. Requires multiple sampling per training example and works best on tasks with clear correctness criteria.

5. Think Only When You Need with Large Hybrid‑Reasoning Models (LHRM)

https://arxiv.org/abs/2505.14631

Method

LHRM combines a direct‑answer mode and a deep‑thinking mode. A two‑stage training pipeline—Hybrid Fine‑Tuning (HFT) followed by Hybrid Group Policy Optimization (HGPO)—teaches the model to select the appropriate mode based on input difficulty.

Training & Inference

HFT uses supervised data labeled simple vs. complex. HGPO is an RL algorithm that jointly optimizes the “mode‑selection” policy and the answer‑generation policy.

Experiments

Across diverse benchmarks, LHRM matches or exceeds the accuracy of always‑thinking models while cutting inference cost dramatically.

Innovation & Limitations

Introduces the “Hybrid Accuracy” metric and demonstrates effective mode‑switching. Requires curated simple/complex datasets and may misclassify difficulty in safety‑critical scenarios.

6. Thinkless: Learning When to Think

https://arxiv.org/abs/2505.13379

Method

Thinkless adds two control tokens <short> and <think> to the vocabulary. During RL fine‑tuning (DeGRPO algorithm) the model learns when to emit each token, thereby selecting a concise answer or a full CoT.

Training & Inference

Two‑stage training: a warm‑up phase to expose the tokens, followed by RL where rewards balance token‑efficiency and answer correctness.

Experiments

On math benchmarks, Thinkless reduces long‑chain usage by 50 %–90 % and often improves accuracy because it avoids over‑thinking errors.

Innovation & Limitations

Explicit control tokens make mode selection transparent. However, it requires datasets containing both simple and complex examples and modest model‑architecture changes to recognize new tokens.

7. ThinkPrune: Pruning Long Chains via RL

https://arxiv.org/abs/2504.01296

Method

ThinkPrune imposes a token‑budget during RL training; any output exceeding the budget receives zero reward, forcing the model to compress its reasoning.

Training & Inference

Iterative length‑constraint tightening across multiple RL stages allows the model to gradually adapt without catastrophic performance loss.

Experiments

On AIME 2024, reasoning length halves while accuracy drops only ~2 %. Similar gains observed on other math benchmarks.

Innovation & Limitations

Integrates length constraints directly into the RL objective, yielding an end‑to‑end pruning mechanism. Requires clear correctness signals and multiple training rounds, increasing compute cost.

8. AdaCoT: Adaptive CoT Triggering via RL

https://arxiv.org/abs/2505.11896

Method

AdaCoT treats the decision to invoke CoT as a Pareto‑optimal multi‑objective problem (accuracy vs. cost). PPO‑based RL adjusts a penalty term for CoT usage, while Selective Loss Masking (SLM) prevents the policy from collapsing to always‑on or always‑off.

Training & Inference

Multi‑stage training gradually tightens the CoT penalty. At inference, the model computes an internal complexity score and triggers CoT only when the score exceeds a learned threshold.

Experiments

On a production traffic test set, CoT trigger rate falls to 3.18 % and average response length shrinks by 69 %, yet performance on difficult queries remains on par with always‑CoT baselines.

Innovation & Limitations

Provides a theoretically grounded RL formulation for adaptive CoT. Lack of open‑source code and dependence on large‑scale interaction data limit reproducibility.

9. Learning When to Think (AutoThink)

https://arxiv.org/abs/2505.10832

Method

AutoThink discovers that inserting an ellipsis "..." into prompts randomly toggles the model between thinking and short‑answer modes. It then uses multi‑stage RL with stage‑wise reward shaping to systematically teach the model to make this decision based on problem difficulty.

Training & Inference

Two‑phase RL: initial warm‑up with the ellipsis cue, followed by progressive reward tightening that favors concise answers on easy tasks and full CoT on hard ones.

Experiments

On five math benchmarks, AutoThink improves accuracy by up to 6.4 % while cutting token usage by 52 %.

Innovation & Limitations

Leverages an emergent “ellipsis trigger” without adding new tokens. Evaluation is limited to math; effectiveness on other domains remains open.

10. AdaptThink: When to Think

https://arxiv.org/abs/2505.13417

Method

AdaptThink formulates a constrained RL objective that rewards NoThinking (direct answer) while penalizing accuracy loss. Importance sampling ensures balanced exposure to both thinking and non‑thinking trajectories during on‑policy training.

Training & Inference

No extra control tokens are introduced; the learned policy is embedded in the model weights and implicitly decides the mode at inference time.

Experiments

On three math datasets, response length drops 53 % and accuracy rises 2.4 %, confirming that eliminating unnecessary reasoning can even boost performance.

Innovation & Limitations

Explicitly encodes the trade‑off between speed and accuracy in the RL loss and uses importance sampling to avoid mode collapse. Still focused on tasks with clear correctness criteria and may need task‑specific difficulty calibration.

Conclusion

All ten works demonstrate that longer reasoning chains are not inherently superior. By applying reinforcement learning, dynamic early‑exit, adaptive prompting, or simple inference‑time heuristics, models can learn to “think fast, think slow” as needed, achieving substantial token‑ and latency‑savings without sacrificing—and sometimes even improving—accuracy.

AI research model pruning adaptive inference chain-of-thought reasoning efficiency

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.