OptScale: Probabilistic Optimal Stopping for Inference‑Time Scaling

OptScale introduces a probabilistic framework that determines the optimal number of inference samples needed to meet a target accuracy with a confidence guarantee, dramatically reducing token usage while maintaining or improving performance across various large‑language‑model benchmarks.

Data Party THU
Data Party THU
Data Party THU
OptScale: Probabilistic Optimal Stopping for Inference‑Time Scaling

This article presents OptScale, a novel algorithm that reframes inference‑time scaling as a probabilistic optimal stopping problem rather than a heuristic "more samples = better" rule.

1. Rethinking Inference‑Time Scaling

Traditional methods such as Self‑Consistency, Best‑of‑N, DeepSeek‑R1, and o1‑style generate multiple parallel inference paths and select the best answer, assuming that more samples always improve performance.

Sampling more does not always help; there exists a computable, confidence‑guaranteed optimal sampling point.

2. Probabilistic Modeling

OptScale treats each verifier score (e.g., from a PRM) as an independent draw from a common distribution. By modeling these scores with a truncated normal distribution, the method derives the cumulative distribution function (CDF) of the maximum score among N samples.

This enables the calculation of the expected best score and its marginal gains, revealing diminishing returns as N grows.

3. Optimal Stopping Condition

Given a target success probability (e.g., 95% accuracy) and a confidence level α, OptScale computes the minimal N that satisfies the condition: Pr(max_score ≥ threshold) ≥ α The resulting N is the theoretically optimal number of samples, turning the "should we sample again?" question into a calculable decision.

4. Estimating the Verifier Score Distribution

Since the true distribution is unknown during inference, OptScale approximates it with a truncated normal distribution parameterized by the observed mean and variance of verifier scores.

5. Two Implementations

OptScaleᵗ : Uses lightweight offline‑trained MLPs to predict distribution parameters and updates them online via MAP estimation.

OptScale⁰ : Training‑free; relies solely on historical verifier scores and updates parameters via maximum likelihood estimation.

6. Adaptive Inference Workflow

Generate a small batch of inference samples.

Compute verifier scores and estimate the distribution.

Calculate the theoretical optimal sample count.

If the current sample count is below this number, continue sampling; otherwise stop.

This process yields early stopping for easy queries and allocates more compute to harder ones, reducing overall token consumption.

7. Experimental Results

Across five mathematical reasoning benchmarks (MATH‑500, GSM8K, AIME 2024/2025, AMC 2023) and multiple model sizes, OptScale consistently achieves comparable or higher accuracy with 30‑55% fewer tokens than Best‑of‑N and Self‑Consistency baselines.

Key findings include:

Token savings increase with larger N (e.g., 38% at N=16, 54% at N=64).

OptScale⁰ excels on simpler tasks, while OptScaleᵗ performs better on very hard tasks.

In some cases OptScale surpasses the theoretical ceiling of Best‑of‑N even with infinite sampling.

8. Significance

OptScale provides the first theoretically grounded lower bound for inference‑time scaling, shifting the paradigm from empirical sampling to probabilistic optimality. It offers a universal framework applicable to any system that relies on parallel sampling, including emerging agents like o1 and DeepSeek‑R1.

By integrating probability modeling, theoretical bounds, and practical algorithms, OptScale answers the critical question: "When should we stop sampling?"

Probabilistic Modelinginference scalingOptimal StoppingToken Efficiency
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.