Artificial Intelligence 9 min read

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

The article introduces the Think When You Need (TWYN) method, a reinforcement‑learning approach that dynamically adapts chain‑of‑thought length, dramatically cuts redundant token generation in large language models, and maintains or improves accuracy across diverse reasoning benchmarks.

Xiaohongshu Tech REDtech

Jun 19, 2025

Can Adaptive Chain‑of‑Thought Learning Halve LLM Thinking Time?

Deep thinking models improve reasoning ability through Test‑Time Scaling, but they often generate massive redundant and ineffective thoughts.

Paper title: Think When You Need: Self‑Adaptive Chain‑of‑Thought Learning Paper link: https://arxiv.org/abs/2504.03234 Code link: https://github.com/lefttt/TWYN

Large models such as o3‑high can spend minutes and millions of tokens on a single problem, inflating inference cost without improving results. Existing solutions rely on a fixed length penalty, which requires careful tuning and does not work well for open‑ended tasks.

Think When You Need (TWYN) Method

TWYN trains models with a pairwise reward mechanism that assumes "longer thinking should yield better results" for the same task. The reward combines answer quality and token length, encouraging the model to produce concise yet correct responses without manually setting length penalties.

Pairwise Reward Mechanism

The core idea is to compare every pair of generated answers for the same question. For each pair, a reward is assigned based on correctness and the difference in thinking length; shorter correct answers receive an extra bonus. The final reward for an answer is the sum of its pairwise rewards.

Broad applicability: Reduces thinking length by 47.3% on math tasks and up to 99% on open‑ended tasks while keeping accuracy stable or slightly improved.

Adaptive thinking: Simpler questions see larger reductions; a 1.5B model reduces length by 2.6% on AIME2024 but by 33% on MATH‑500.

Length correlates with model capacity: After training, larger models (7B) cut token usage by 47.3% versus 22.3% for a 1.5B model, mirroring human intuition that smarter students solve problems faster.

Experimental Results

Across multiple reasoning benchmarks (DeepScaleR, AIME 2024, MATH‑500), TWYN shortens answer length dramatically (e.g., from >6000 tokens to <4000) while maintaining or slightly improving accuracy. In open‑ended evaluation on AlpacaFarm, TWYN outperforms standard CoT‑RL in preference scores and reduces chain length to near zero, delivering faster and smoother responses.

The method requires no manual length‑penalty tuning, integrates easily with existing reward structures, and supports both verified and fuzzy tasks, offering a scalable solution for next‑generation efficient AI models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

efficiency Large Language Models Chain-of-Thought reinforcement learning adaptive inference

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.