AutoTTS Shows How AI Agents Can Outperform Human‑Designed Test‑Time Scaling Strategies

The paper “LLMs Improving LLMs” introduces AutoTTS, an environment where a Claude‑based explorer agent automatically searches test‑time scaling policies, achieving up to 69.5% token savings and superior accuracy on unseen models, all for $39.9 and 160 minutes without any LLM calls during evaluation.

PaperAgent
PaperAgent
PaperAgent
AutoTTS Shows How AI Agents Can Outperform Human‑Designed Test‑Time Scaling Strategies

Background and Motivation

Andrej Karpathy announced his move to Anthropic to build a team focused on accelerating pre‑training research with Claude. Shortly after, Google and Meta released a paper titled LLMs Improving LLMs: Agentic Discovery for Test‑Time Scaling , which demonstrates that an AI agent can autonomously discover better inference strategies than those manually crafted by humans.

Test‑Time Scaling (TTS) Problem

During inference, large language models must decide how to allocate compute: run many parallel reasoning paths (width), deepen each path (depth), or dynamically prune low‑quality paths based on intermediate results. Existing TTS methods—Self‑Consistency, ASC, ESC, Parallel‑Probe—are all hand‑tuned heuristics based on intuition.

Limitations of Human‑Tuned Strategies

These heuristics explore only a few points in the width‑depth control space, leaving most of the space unexplored. The authors argue that the entire space can be unified under a single width‑depth formulation, suggesting that systematic search could uncover superior policies.

AutoTTS: Shifting the Burden from Strategy Design to Environment Design

AutoTTS replaces manual strategy design with an environment that enables an AI explorer to search for optimal policies. For each problem, 128 inference paths are pre‑computed offline. An explorer agent (Claude Code in the paper) iteratively tests different controller codes—deciding when to branch, when to probe intermediate results, when to prune, and when to stop—using only the stored offline data. Because evaluation requires no LLM calls, candidate policies can be assessed at zero cost.

The framework collapses the high‑dimensional hyper‑parameter space to a single parameter β , from which all internal parameters are automatically derived. This one‑dimensional search prevents over‑fitting and simplifies exploration.

Experimental Results

Experiments were conducted on four Qwen‑3 models ranging from 0.6 B to 8 B parameters, using AIME24 as the search set and AIME25/HMMT25 as unseen test sets. Key findings include:

When β = 0.5 , token consumption dropped by 69.5% while accuracy remained unchanged, achieving the same performance at less than one‑third the cost.

When β = 1.0 , the discovered policy outperformed all human baselines in 5 of 8 test scenarios, demonstrating genuine strength rather than mere cost savings.

The discovered policies generalized to unseen questions and larger model scales, indicating they are not over‑fitted.

The entire discovery process cost $39.9 and 160 minutes, with zero LLM calls during the evaluation phase.

The best policy, named the Confidence Momentum Controller (CMC), tracks confidence trends with an exponential moving average, avoiding decisions based on transient confidence spikes. This mechanism was not designed by humans but proved superior.

Implications

The work provides a proof‑of‑concept that AI agents can autonomously design inference strategies that surpass human‑crafted ones, effectively “AI‑improving‑AI.” It also signals a paradigm shift: human value moves from hand‑tuning strategies to constructing environments that enable AI‑driven search.

Paper title: LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Paper link: https://arxiv.org/abs/2605.08083
GitHub: https://github.com/zhengkid/AutoTTS
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ClaudeLLM agentsToken efficiencyTest-Time ScalingAutoTTS
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.