Artificial Intelligence 10 min read

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

The paper introduces VSRM, a verifiable step‑reward mechanism that penalizes ineffective reasoning steps and rewards useful ones in large language model inference, dramatically shortening output length while preserving or even improving performance across multiple benchmarks and reinforcement‑learning algorithms.

Meituan Technology Team

Oct 9, 2025

How VSRM Cuts Redundant Reasoning Steps in Large Language Models

1 Background

Large inference models such as DeepSeek‑R1 achieve strong reasoning abilities through simple reinforcement‑learning‑after‑training methods, but tend to generate redundant replies, increasing latency and computational cost. To address this, the Meituan Search Platform algorithm team proposes the Verifiable Step‑Reward Mechanism (VSRM), which encourages effective steps and penalizes ineffective ones, enabling efficient inference without sacrificing performance.

Paper download: PDF

2 The Essence of Overthinking

Previous work identified that models often produce multiple divergent answers for simple problems, leading to overthinking. A case study on MATH‑500 shows models repeatedly reconsider a trivial sub‑question (e.g., counting integers in [-500,0]), hopping between correct and incorrect intermediate conclusions, ultimately yielding a wrong final answer. Such ineffective steps are frequent and constitute the root cause of overthinking.

3 Designing Verifiable Stepwise Rewards

3.1 Step Segmentation

The first step is to locate all reasoning steps. In chain‑of‑thought (CoT) generation, special tokens such as "However", "Therefore", "So", "But", "Wait" often signal the end of a step and the start of the next. Additional rules ensure readability: skip initial re‑statement tokens, enforce a minimum distance between split points, and place split points at the beginning of sentences when special tokens appear inside.

3.2 Reward Allocation

Effectiveness of a step is measured by the accuracy gain before and after the step, which can be obtained verifiably. By inserting a </think> token before each split point, the segment up to that token forms a sub‑trajectory; the model generates multiple candidate answers for each sub‑trajectory, and the average correctness reflects the step’s contribution. The difference in correctness between adjacent sub‑trajectories provides the step‑level reward.

To avoid sparse rewards when several consecutive steps yield no accuracy change, a forward‑looking window propagates future accuracy changes back to the current step via a discount factor, ensuring dense reward signals.

Unlike simple length penalties, VSRM directly supplies clear reward signals that guide the model toward steps that improve final accuracy, mitigating overthinking while preserving performance.

VSRM is decoupled from specific reinforcement‑learning algorithms and can be seamlessly integrated by adding step rewards to the final reward tensor alongside standard binary and format rewards.

4 Experiments

On common mathematical benchmarks, three base models and two RL algorithms were evaluated, comparing VSRM with recent related methods. Results show VSRM reduces output length while maintaining strong performance, achieving a good balance.

Ablation studies confirm the effectiveness of the forward‑looking window and demonstrate that an additional explicit length penalty does not further benefit VSRM.

On harder benchmarks, increasing k improves Pass@k, indicating enhanced solution exploration. VSRM‑PPO models retain this trend, showing that compressing output length does not sacrifice the model’s ability to explore viable answers.

5 Conclusion

Extensive comparative experiments demonstrate that verifiable stepwise rewards consistently preserve performance across different RL algorithms and base models while substantially alleviating overthinking. Ablation and further analyses confirm that VSRM effectively suppresses ineffective steps and promotes useful ones, offering a fundamental solution to overthinking and maintaining robust reasoning behavior.

AI Efficient Inference large-language-models reinforcement-learning step-reward

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.