Is More Chain‑of‑Thought Always Better? Introducing E‑GRM for On‑Demand LLM Reasoning

The article critically examines the assumption that longer chain‑of‑thought reasoning always improves large language model performance, presents the E‑GRM framework that dynamically decides when to invoke full CoT based on model‑internal uncertainty, and validates its efficiency and accuracy gains through extensive experiments and ablations.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Is More Chain‑of‑Thought Always Better? Introducing E‑GRM for On‑Demand LLM Reasoning

1. Introduction

Generative Reward Models (GRM) use chain‑of‑thought (CoT) prompting to let large language models (LLMs) generate step‑by‑step reasoning for response evaluation, which has become the dominant approach for complex reasoning tasks. However, existing GRM systems apply the same full‑CoT pipeline to every input, ignoring the inherent difficulty variance among questions.

1.1 Why uniform CoT is suboptimal

Cost of homogenized processing : Simple queries such as "1+1" consume the same computational resources as solving a differential equation, leading to unnecessary compute expenditure.

Voting mechanism limitation : Current GRM pipelines often aggregate multiple reasoning paths via voting (e.g., Self‑Consistency), treating a correct answer arrived by deep reasoning the same as one guessed correctly, which fails to distinguish answer quality.

2. Core Technique: Dynamic CoT Trigger

2.1 Uncertainty estimation via consensus

E‑GRM measures the agreement among M parallel decodings of the same input. If the most frequent answer appears c times, the consensus score is Consensus = c / M, ranging from 0 (high disagreement) to 1 (complete agreement). High consensus indicates the model is confident and can skip full CoT.

2.2 Routing decision

The routing rule is a binary threshold:

Route(x) = \begin{cases}
    \text{Short‑path}, & \text{Consensus}(x) \ge \tau \\
    \text{Long‑path},  & \text{Consensus}(x) < \tau
\end{cases}

With M = 5 and τ = 0.8, at least four out of five decodings must agree to take the short path, directly outputting the most frequent answer and bypassing CoT generation.

3. Discriminative Scorer with Hybrid Loss

When the long path is triggered, E‑GRM generates K candidate CoT chains and scores each with a lightweight discriminative module that takes the input x and a reasoning path r and outputs a scalar quality score. The scorer is trained with a hybrid loss combining:

Huber loss for regression, making the predicted score robust to outliers.

Hinge loss for ranking, enforcing a margin between positive (correct) and negative (incorrect) reasoning paths.

A weighting factor that balances the two objectives within the (0, 1) interval.

This design yields both calibration (scores reflect true correctness probability) and discrimination (reliable ordering of good vs. bad reasoning).

4. Two‑Stage Training

During supervised fine‑tuning (SFT), each training sample is decoded M times to compute consensus. Samples with consensus ≥ τ are placed in the short‑path set (learning direct input‑to‑answer mapping), while the rest form the long‑path set (learning full CoT generation). This separation enables the model to learn both fast‑answer and deep‑reasoning behaviors.

In the preference‑optimization stage (GRPO), the reward function combines a hard correctness term with the discriminative scorer’s quality difference, weighted in the [0, 1] range, and includes a KL‑divergence regularizer to keep the policy close to the SFT reference.

5. Experiments

Benchmark overview (RM‑Bench, 32B model) :

Chat: 75.6 %

Math: 80.0 %

Code: 66.5 %

Safety: 94.2 % (highest among baselines)

Average across domains: 79.2 %

Efficiency on the MATH dataset :

58 % of samples follow the short path, completely skipping CoT generation.

Inference latency drops from 3.8 s (forced‑CoT) to 2.2 s, a 62 % reduction.

FLOPs decrease from 23.7 T to 15.7 T, a 49 % saving.

Accuracy improves from 75.1 % to 78.4 %.

Ablation study (MATH) (key variants):

Full E‑GRM : 78.4 % accuracy, 15.7 T FLOPs, 2.2 s latency.

Remove dynamic trigger: 75.2 % accuracy, 23.4 T FLOPs, 3.4 s latency.

Remove discriminative scorer: 72.8 % accuracy, 15.9 T FLOPs, 2.2 s latency.

Baseline CoT‑GRM: 69.1 % accuracy, 23.7 T FLOPs, 3.6 s latency.

The ablations confirm that dynamic triggering drives most of the efficiency gain, while the discriminative scorer contributes the majority of the accuracy improvement.

6. Contributions and Limitations

Contributions :

Uncertainty‑driven dynamic routing that repurposes model‑internal confidence as a compute‑allocation signal.

Hybrid‑loss discriminative scorer that provides calibrated, continuous quality estimates, outperforming binary voting.

End‑to‑end framework integrating dynamic trigger, scorer, and preference optimization, achieving simultaneous latency reduction and accuracy gains.

Limitations :

Parallel decoding (M = 5) adds ~5 % overhead to latency.

The threshold τ = 0.8 is empirically stable on in‑distribution data but may require recalibration for out‑of‑distribution or adversarial inputs.

Scorer generalization to unseen reasoning patterns remains an open question.

7. Conclusion

E‑GRM challenges the prevailing belief that “more reasoning is always better” by showing that intelligent models should also know when not to reason deeply. On the MATH benchmark, it identifies 58 % of questions as easy, cuts inference latency by 62 % while raising accuracy to 78.4 %, demonstrating a practical path toward efficient, high‑fidelity LLM evaluation.

Paper title: Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
arXiv link: https://arxiv.org/abs/2604.10072
Authors: Tencent Hunyuan & UNSW
Conference: ACL 2026
Keywords: E‑GRM, Dynamic CoT Trigger, Model‑Internal Uncertainty, Discriminative Scoring, GRM, Efficiency, Reward Fidelity
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Efficiencylarge language modelsEvaluationChain of ThoughtDynamic RoutingAblation StudyGenerative Reward Model
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.