9 min read

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

DECS, a novel training framework introduced by researchers from Fudan, Shanghai Jiao Tong, and the Shanghai AI Lab, theoretically exposes the flaws of length‑penalty rewards and, through token‑level reward decoupling and dynamic batch scheduling, reduces inference token counts by over 50% while improving accuracy across multiple benchmarks.

Machine Heart

May 12, 2026

DECS Cuts Overthinking in Models: Halve Inference Tokens and Raise Accuracy

Overthinking in Large Language Models

Recent chain‑of‑thought models such as DeepSeek‑R1 and OpenAI GPT‑Thinking achieve strong performance on complex reasoning tasks by generating thousands of tokens, but they often exhibit overthinking : after arriving at the correct answer the model continues to emit self‑correction phrases like "wait...", "let me check...", and "alternatively...", leading to large amounts of redundant computation.

Why Length‑Penalty Rewards Fail

The authors analyze the common practice of adding a sequence‑length penalty to reinforcement‑learning objectives (e.g., GRPO). Two critical defects emerge:

Undifferentiated attack on high‑entropy exploration tokens. Tokens such as "wait" or "however" are high‑entropy and essential for exploring the reasoning space, yet the length penalty applies a uniform negative gradient to every token in a long reasoning chain, eventually suppressing exploration and causing premature convergence to sub‑optimal strategies.

Implicit reward for redundant tokens. The paper introduces the concept of Necessary Reasoning Prefix (NRP) , the shortest token sequence that yields the correct answer. Tokens beyond the NRP are redundant, but because the overall sequence remains relatively short, existing sequence‑level rewards may still assign positive credit to these extra tokens, distorting optimization and preventing the model from stopping.

DECS: Decoupling Rewards to Eliminate Redundancy

To address the two defects, DECS reconstructs training in two dimensions:

Step 1 – Token‑level reward decoupling. A lightweight NRP detector (judge model) identifies the boundary between necessary and redundant tokens. Tokens inside the NRP receive no penalty, while every token after the NRP is assigned a constant negative reward, ensuring that only truly redundant reasoning is penalized.

Step 2 – Curriculum‑style batch scheduling. Early in training, the penalty signal may inadvertently suppress high‑entropy exploratory tokens. DECS dynamically adjusts the proportion of easy (simple) questions in each batch: when the average NRP ratio is low (many redundant tokens), fewer simple questions are presented; as redundancy diminishes, the share of simple questions is gradually increased. This buffering mechanism preserves exploration while compressing reasoning.

Experimental Validation

The authors evaluate DECS on three base models—DeepSeek‑R1‑Distill‑1.5B, 7B, and Qwen3‑4B—across seven benchmarks (AIME2024/2025, MATH500, GPQA‑Diamond, LiveCodeBench‑v6, etc.). Results show:

1.5B model: average inference token count reduced by 57.17% and Pass@1 accuracy increased by 2.48 percentage points .

7B model: token reduction of 49.50% with an accuracy gain of 0.8 percentage points .

Compared with baselines ThinkPrune, TLMRE, and LC‑R1, DECS leads the efficiency‑performance AES score by 0.12 and 0.14 respectively.

Cross‑Domain Generalization

Although the NRP detector is trained only on mathematical data, DECS achieves substantial token reductions on scientific reasoning (GPQA‑Diamond, 56.33% ) and programming tasks (LiveCodeBench‑v6, 33.52% ), demonstrating that overthinking is a systematic, cross‑domain phenomenon.

Ablation Study

Removing curriculum scheduling causes notable performance degradation, confirming that the dynamic batch scheme mitigates the exploration‑suppression issue. Omitting reward decoupling leaves about 25% redundant tokens, showing that sequence‑level penalties alone cannot eliminate all redundancy.

Implications

DECS proves that the primary bottleneck for efficient inference is not model capacity but the design of the training objective. By returning to the fundamentals of reward‑function engineering, DECS offers an open‑source solution that halves inference cost without sacrificing, and even improving, accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models benchmark evaluation reward design inference efficiency overthinking token reduction DECS

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.