Why Large Language Models Need Not Run CoT on Every Question: Tencent Hunyuan’s On‑Demand CoT Trigger
The paper analyzes the efficiency and reward‑signal shortcomings of conventional generative reward models (GRM) and presents the E‑GRM framework, which uses model‑internal uncertainty to dynamically trigger chain‑of‑thought reasoning, employs a consensus‑based routing decision and a mixed‑loss discriminative scorer, achieving significant speed‑up and accuracy gains on benchmarks such as MATH, RM‑Bench and RewardBench.
Introduction: Efficiency Bottlenecks of GRM
Generative Reward Models (GRM) improve large language model (LLM) reasoning by prompting chain‑of‑thought (CoT) generation, but existing implementations suffer from two fundamental issues: (1) computational inefficiency because they apply full CoT to every input regardless of difficulty, and (2) coarse reward signals due to voting‑based aggregation that cannot distinguish high‑quality from erroneous reasoning paths.
Prior adaptive CoT approaches rely on handcrafted heuristics or task‑specific features, limiting generalisation and requiring extensive tuning.
E‑GRM Core Method: Dynamic CoT Trigger via Model‑Internal Uncertainty
2.1 Quantifying Model Internal Uncertainty
Given an input prompt, E‑GRM first performs k parallel decodings, each with different sampling hyper‑parameters (e.g., temperature, top‑p), producing a set of initial responses. The consensus metric is defined as the frequency with which the same answer appears among the k responses, directly quantifying the model’s internal uncertainty and reflecting problem complexity.
2.2 Dynamic Routing Decision
Based on the computed consensus, E‑GRM makes a binary routing decision using a preset threshold (0.8 in the paper):
Short‑path : If consensus ≥ threshold, the problem is deemed simple or the model is highly confident; the system outputs the consensus answer directly, skipping CoT generation and saving computation.
Long‑path : If consensus < threshold, the problem is considered complex; the system triggers the full CoT generation pipeline to produce step‑by‑step reasoning.
Figure 1 (left) illustrates the parallel decoding and consensus calculation; the right side shows the discriminative scorer evaluating multiple CoT paths.
2.3 Theoretical Basis and Advantages
The authors argue that for simple or factual questions the model’s conditional probability distribution peaks sharply, causing different decoding paths to converge quickly to the same answer (high consensus). For complex reasoning tasks the distribution is flatter, leading to diverse outputs (low consensus). This insight yields three methodological advantages:
Task‑agnostic : No hand‑crafted features; the mechanism relies solely on model behaviour, enabling cross‑domain generalisation.
Computationally lightweight : Parallel decoding adds only ~5 % latency compared with full CoT.
Empirical effectiveness : On the MATH dataset, 58 % of samples are identified as short‑path, providing a solid basis for efficiency gains.
Empirical Validation
On the MATH benchmark, dynamic CoT triggering reduces average inference latency from 3.8 s (forced‑CoT) to 2.2 s (62 % reduction) and FLOPs from 23.7 T to 15.7 T (49 % reduction), while accuracy improves from 75.1 % to 78.4 %.
Broader evaluations show:
RM‑Bench: 32B model achieves 79.2 % average standardized score, surpassing GPT‑4o (0.738).
RMB benchmark: overall score 0.743.
RewardBench: overall score 91.5 %, with inference sub‑score 95.4 % and safety sub‑score 92.0 %.
Ablation studies confirm that removing the dynamic trigger increases FLOPs by 49 % and latency by 55 %, while removing the discriminative scorer drops accuracy by up to 5.6 %, demonstrating the necessity of both components.
Discriminative Scorer: Mixed‑Loss Design
To overcome the coarse voting mechanism, E‑GRM introduces a lightweight discriminative scorer that takes the input and a generated reasoning path and outputs a scalar quality score. The scorer is trained with a mixed loss comprising:
Huber regression loss : Robust to outliers, aligns the score with real quality labels or pseudo‑labels.
Hinge ranking loss : Encourages higher scores for positive (high‑quality) paths than for negative ones, with a margin hyper‑parameter.
Balancing hyper‑parameter : Controls the relative weight of regression versus ranking objectives.
This design yields both calibration (scores reflect true quality) and discriminative power (effective ordering), providing high‑fidelity reward signals for subsequent reinforcement learning.
Training Procedure: Two‑Stage Optimisation
Stage 1 – Supervised Fine‑Tuning (SFT)
The dynamic trigger partitions the training data into:
Short‑path sample set : High‑consensus examples where the model learns to map directly from input to final answer without intermediate steps.
Long‑path sample set : Low‑consensus examples where the model learns to generate full CoT chains.
This distinction enables the model to internalise “when to reason” and “how to reason”.
Stage 2 – Preference Optimisation (Extended GRPO)
Building on standard Generative Reward Policy Optimisation (GRPO), E‑GRM extends the objective to better exploit paired preference data. The reward function combines an answer‑correctness term and a scorer‑difference term, while a KL‑regularisation term stabilises policy updates.
Contributions and Discussion
The paper’s main contributions are:
Demonstrating that model‑internal uncertainty can serve as a universal signal for dynamic inference depth.
Designing a mixed‑loss discriminative scorer that provides fine‑grained evaluation of reasoning paths.
Integrating dynamic triggering, discriminative scoring, and policy optimisation into an end‑to‑end trainable framework.
Compared with adaptive CoT methods (e.g., AdaCoT), E‑GRM requires no hand‑crafted heuristics and yields larger efficiency gains (see Table 3). Compared with standard voting‑based GRM, the discriminative scorer better distinguishes reasoning quality, achieving higher accuracy under the same compute budget. Unlike early‑stop mechanisms that rely on single‑step confidence, E‑GRM’s consensus‑based decision is more robust.
Limitations and Future Work
Parallel decoding incurs a modest overhead (~5 %); further optimisation is needed for ultra‑low‑latency scenarios.
The consensus threshold, while effective across several datasets, may need adaptive calibration for highly specialised or out‑of‑distribution domains.
The scorer’s performance depends on the diversity of reasoning styles in the training data; novel reasoning patterns may challenge its generalisation.
Conclusion
E‑GRM leverages model‑internal uncertainty to trigger CoT on demand and employs a mixed‑loss discriminative scorer, delivering a more efficient and accurate generative reward modelling paradigm. By addressing the core trade‑off between efficiency and reward fidelity, the method offers broad applicability and opens new research directions for understanding and utilising uncertainty in large language models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
