Artificial Intelligence 14 min read

Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning

Early‑Stopping Self‑Consistency (ESC) dynamically halts sampling once a sliding‑window answer distribution reaches zero entropy, cutting the number of required LLM reasoning samples by 34‑84 % across arithmetic, commonsense, and symbolic benchmarks while preserving accuracy and offering a theoretically‑bounded, robust, budget‑adaptive alternative to traditional Self‑Consistency.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Early‑Stopping Self‑Consistency (ESC): Reducing Sampling Cost for Large Language Model Reasoning

Large language models (LLMs) achieve strong reasoning abilities when guided by Chain‑of‑Thought (CoT) prompts, which simulate step‑by‑step human thinking. Self‑Consistency (SC) is a widely used decoding strategy that generates multiple reasoning paths and selects the majority answer, greatly improving performance on multi‑step tasks but incurring high sampling costs.

At ICLR 2024, the Xiaohongshu search algorithm team introduced Early‑Stopping Self‑Consistency (ESC), a simple and scalable sampling process that dramatically lowers SC’s cost without sacrificing accuracy. ESC dynamically stops sampling when the answer distribution within a sliding window has zero entropy (i.e., all samples agree), thereby truncating the decoding process.

Experiments were conducted on three representative reasoning tasks—arithmetic (MATH, GSM8K), commonsense (CommonsenseQA, StrategyQA), and symbolic (Last Letter Concatenation, Coin Flip)—using GPT‑4, GPT‑3.5‑Turbo, and LLaMA‑2 7B in a few‑shot setting. ESC reduced the average number of samples by 33.8%–84.2% across the six benchmarks while maintaining performance comparable to full SC.

The authors provide a theoretical analysis showing that the probability of inconsistency between ESC and SC is bounded by a negligible value (e.g., <0.002 when the window size is 8). A dynamic control scheme is derived to select optimal window sizes and maximum sampling numbers for different tasks and models, achieving a desirable performance‑cost trade‑off without any model‑specific tuning.

Robustness studies demonstrate that ESC is stable across varying sampling budgets, temperature settings, top‑k values, and even in zero‑shot scenarios. Additional experiments on open‑domain generation (MBPP) confirm that ESC extends to tasks without fixed answers.

Overall, ESC offers a cost‑effective alternative to traditional SC, enabling large‑scale LLM inference with substantially fewer samples while preserving accuracy, and its dynamic control mechanism adapts to diverse budget and performance requirements.

AILLMChain-of-ThoughtSelf-ConsistencyEarly-StoppingInferenceSampling Efficiency
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.