16 min read

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

The article analyzes how reward‑shaping techniques can shorten the chain‑of‑thought outputs of Qwen 30‑parameter series models by 20‑40% while preserving or slightly improving performance on AIME‑25 and out‑of‑distribution benchmarks, and it details the experimental design, strategic considerations, and practical insights behind this efficient reasoning approach.

Machine Learning Algorithms & Natural Language Processing

Apr 16, 2026

Efficient Reasoning with Reward Shaping: Compressing Qwen 30B‑Series Chains by 20‑40%

Background

Scaling laws enable large language models (LLMs) to achieve strong performance on complex tasks, but they also produce excessively long chain‑of‑thought (CoT) outputs that increase latency.

Reward Shaping for Efficient Reasoning

Reward shaping incorporates output length as a variable in the reward function, granting positive reward to correct trajectories whose token count is below a target length and negative reward otherwise. Advantages include low computational overhead, model‑agnostic applicability, and rapid iteration of sub‑models by changing the reward function.

Strategic Evaluation

Evaluating compressed models only on short‑length budgets is risky because models may overfit to short outputs, limiting gains when the token budget is increased. Original (uncompressed) models still benefit substantially from larger budgets, as observed in prior work (e.g., @ybq’s “Make PostTrain Solid Again”).

Length‑biased reward shaping shows strong cross‑difficulty and cross‑domain generalization; the bias encourages concise yet correct reasoning across varied tasks.

Training Methodology

The training uses a simple truncation strategy: for each correct answer a target length TargetLength is set; trajectories shorter than the target receive a positive reward, while longer correct trajectories and any incorrect trajectories receive a negative reward. Experiments start with DeepSeek‑Distill‑Qwen‑1.5B and extend to larger Qwen‑3 models.

Two‑Stage Paradigm

Stage 1 – Length Adaptation: Early training quickly reduces output length, but downstream performance may dip.

Stage 2 – Reasoning Refinement: Length stabilizes (or slightly rises) while task performance recovers.

The model first learns the length constraint because the length target is easier to achieve than accuracy improvement.

Factor 1: Prompt Difficulty

Prompts are split into easy (pass rate ≥5/8) and hard (pass rate <5/8) subsets based on eight‑sample evaluations of the base model. Training on the hard subset causes rapid length reduction but a catastrophic drop in downstream performance, whereas training on the easy subset yields stable training and sometimes higher upper bounds. Policy entropy mirrors these trends: hard prompts produce large entropy fluctuations, easy prompts produce smooth entropy curves.

Factor 2: Negative‑Sample Reward Design

Negative samples include overly long correct trajectories, short incorrect trajectories, and long incorrect trajectories. Reward designs compare penalizing (assigning –1) versus masking (ignoring) these samples. Masking overly long correct trajectories improves downstream performance but increases output length. Certain designs (e.g., penalizing short‑incorrect but not long‑correct) lead to a “length trap” where the model over‑optimizes for shortness, causing performance collapse.

Factor 3: Off‑Policy Staleness

Introducing higher off‑policy staleness accelerates convergence of both stages and raises the performance ceiling on DeepSeek‑Distill‑Qwen‑1.5B, but also enlarges policy entropy and instability. On Qwen‑3 series, off‑policy staleness speeds length reduction but severely degrades performance, so strict on‑policy training is recommended for larger models.

Experimental Results

On the Qwen‑3 series (0.6 B – 30 B) chain length is compressed by 20 %–40 % while AIME’25 scores remain unchanged or improve slightly.

On an internal OOD dataset, the Qwen‑3‑30B‑A3B‑Instruct‑2507 model reduces average CoT length from 5.6 k to 3.7 k tokens (34 % reduction) with a mean score drop from 25.22 to 24.90. The same compression ratio and comparable scores are observed on Qwen‑3‑30B‑A3B‑Thinking‑2507 and across ten completely OOD domain test sets.

Key Observations

Models adapt to length constraints rapidly; simpler prompts yield better outcomes.

Maintaining the target length equal to the desired output length and staying on‑policy preserves training stability.

Avoid the “short = good, long = bad” trap by not over‑emphasizing short correct trajectories.

Negative samples are typically longer than positive samples, providing an implicit drive toward brevity.

Future Directions

Explore more diverse training data for potential gains.

Optimize sampling length dynamically rather than using a fixed low truncation rate.

Validate findings on larger models.

Investigate tool‑based cognition to further simplify reasoning.

References

[1] https://arxiv.org/pdf/2502.03373
[2] https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd
[3] https://api-docs.deepseek.com/news/news250929
[4] https://zhuanlan.zhihu.com/p/1910075005586346955
[5] https://openai.com/zh-Hans-CN/index/introducing-gpt-oss/
[6] https://zhuanlan.zhihu.com/p/1910447407369552022
[7] https://arxiv.org/abs/2505.15612
[8] https://lilianweng.github.io/posts/2025-05-01-thinking/
[9] https://zhuanlan.zhihu.com/p/1995265459285694156
[10] https://arxiv.org/abs/2510.01161
[11] https://arxiv.org/pdf/2510.03222
[12] https://www.techbeat.net/talk-info?id=1011

Project Links

Project homepage: https://wutaiqiang.github.io/project/Art
Paper (ArXiv): https://arxiv.org/pdf/2602.20945
Weights: https://huggingface.co/collections/taki555/the-art-of-efficient-reasoning

Qwen reinforcement learning Efficient Inference Reward Shaping

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Reward Shaping for Efficient Reasoning

Strategic Evaluation

Training Methodology

Two‑Stage Paradigm

Factor 1: Prompt Difficulty

Factor 2: Negative‑Sample Reward Design

Factor 3: Off‑Policy Staleness

Experimental Results

Key Observations

Future Directions

References

Project Links

Machine Learning Algorithms & Natural Language Processing

How this landed with the community

Was this worth your time?

0 Comments

Factor 1: Prompt Difficulty

Factor 2: Negative‑Sample Reward Design

Factor 3: Off‑Policy Staleness