AutoTTS Shows How AI Agents Can Outperform Human‑Designed Test‑Time Scaling Strategies
The paper “LLMs Improving LLMs” introduces AutoTTS, an environment where a Claude‑based explorer agent automatically searches test‑time scaling policies, achieving up to 69.5% token savings and superior accuracy on unseen models, all for $39.9 and 160 minutes without any LLM calls during evaluation.
Background and Motivation
Andrej Karpathy announced his move to Anthropic to build a team focused on accelerating pre‑training research with Claude. Shortly after, Google and Meta released a paper titled LLMs Improving LLMs: Agentic Discovery for Test‑Time Scaling , which demonstrates that an AI agent can autonomously discover better inference strategies than those manually crafted by humans.
Test‑Time Scaling (TTS) Problem
During inference, large language models must decide how to allocate compute: run many parallel reasoning paths (width), deepen each path (depth), or dynamically prune low‑quality paths based on intermediate results. Existing TTS methods—Self‑Consistency, ASC, ESC, Parallel‑Probe—are all hand‑tuned heuristics based on intuition.
Limitations of Human‑Tuned Strategies
These heuristics explore only a few points in the width‑depth control space, leaving most of the space unexplored. The authors argue that the entire space can be unified under a single width‑depth formulation, suggesting that systematic search could uncover superior policies.
AutoTTS: Shifting the Burden from Strategy Design to Environment Design
AutoTTS replaces manual strategy design with an environment that enables an AI explorer to search for optimal policies. For each problem, 128 inference paths are pre‑computed offline. An explorer agent (Claude Code in the paper) iteratively tests different controller codes—deciding when to branch, when to probe intermediate results, when to prune, and when to stop—using only the stored offline data. Because evaluation requires no LLM calls, candidate policies can be assessed at zero cost.
The framework collapses the high‑dimensional hyper‑parameter space to a single parameter β , from which all internal parameters are automatically derived. This one‑dimensional search prevents over‑fitting and simplifies exploration.
Experimental Results
Experiments were conducted on four Qwen‑3 models ranging from 0.6 B to 8 B parameters, using AIME24 as the search set and AIME25/HMMT25 as unseen test sets. Key findings include:
When β = 0.5 , token consumption dropped by 69.5% while accuracy remained unchanged, achieving the same performance at less than one‑third the cost.
When β = 1.0 , the discovered policy outperformed all human baselines in 5 of 8 test scenarios, demonstrating genuine strength rather than mere cost savings.
The discovered policies generalized to unseen questions and larger model scales, indicating they are not over‑fitted.
The entire discovery process cost $39.9 and 160 minutes, with zero LLM calls during the evaluation phase.
The best policy, named the Confidence Momentum Controller (CMC), tracks confidence trends with an exponential moving average, avoiding decisions based on transient confidence spikes. This mechanism was not designed by humans but proved superior.
Implications
The work provides a proof‑of‑concept that AI agents can autonomously design inference strategies that surpass human‑crafted ones, effectively “AI‑improving‑AI.” It also signals a paradigm shift: human value moves from hand‑tuning strategies to constructing environments that enable AI‑driven search.
Paper title: LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Paper link: https://arxiv.org/abs/2605.08083
GitHub: https://github.com/zhengkid/AutoTTSSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
