OptScale: Probabilistic Optimal Stopping for Inference‑Time Scaling
OptScale introduces a probabilistic framework that determines the optimal number of inference samples needed to meet a target accuracy with a confidence guarantee, dramatically reducing token usage while maintaining or improving performance across various large‑language‑model benchmarks.
This article presents OptScale, a novel algorithm that reframes inference‑time scaling as a probabilistic optimal stopping problem rather than a heuristic "more samples = better" rule.
1. Rethinking Inference‑Time Scaling
Traditional methods such as Self‑Consistency, Best‑of‑N, DeepSeek‑R1, and o1‑style generate multiple parallel inference paths and select the best answer, assuming that more samples always improve performance.
Sampling more does not always help; there exists a computable, confidence‑guaranteed optimal sampling point.
2. Probabilistic Modeling
OptScale treats each verifier score (e.g., from a PRM) as an independent draw from a common distribution. By modeling these scores with a truncated normal distribution, the method derives the cumulative distribution function (CDF) of the maximum score among N samples.
This enables the calculation of the expected best score and its marginal gains, revealing diminishing returns as N grows.
3. Optimal Stopping Condition
Given a target success probability (e.g., 95% accuracy) and a confidence level α, OptScale computes the minimal N that satisfies the condition: Pr(max_score ≥ threshold) ≥ α The resulting N is the theoretically optimal number of samples, turning the "should we sample again?" question into a calculable decision.
4. Estimating the Verifier Score Distribution
Since the true distribution is unknown during inference, OptScale approximates it with a truncated normal distribution parameterized by the observed mean and variance of verifier scores.
5. Two Implementations
OptScaleᵗ : Uses lightweight offline‑trained MLPs to predict distribution parameters and updates them online via MAP estimation.
OptScale⁰ : Training‑free; relies solely on historical verifier scores and updates parameters via maximum likelihood estimation.
6. Adaptive Inference Workflow
Generate a small batch of inference samples.
Compute verifier scores and estimate the distribution.
Calculate the theoretical optimal sample count.
If the current sample count is below this number, continue sampling; otherwise stop.
This process yields early stopping for easy queries and allocates more compute to harder ones, reducing overall token consumption.
7. Experimental Results
Across five mathematical reasoning benchmarks (MATH‑500, GSM8K, AIME 2024/2025, AMC 2023) and multiple model sizes, OptScale consistently achieves comparable or higher accuracy with 30‑55% fewer tokens than Best‑of‑N and Self‑Consistency baselines.
Key findings include:
Token savings increase with larger N (e.g., 38% at N=16, 54% at N=64).
OptScale⁰ excels on simpler tasks, while OptScaleᵗ performs better on very hard tasks.
In some cases OptScale surpasses the theoretical ceiling of Best‑of‑N even with infinite sampling.
8. Significance
OptScale provides the first theoretically grounded lower bound for inference‑time scaling, shifting the paradigm from empirical sampling to probabilistic optimality. It offers a universal framework applicable to any system that relies on parallel sampling, including emerging agents like o1 and DeepSeek‑R1.
By integrating probability modeling, theoretical bounds, and practical algorithms, OptScale answers the critical question: "When should we stop sampling?"
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
