Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough
The article examines the Test‑Time Training to Discover (TTT‑Discover) approach, which applies reinforcement learning during inference to let large language models continuously improve on single test problems, and reports strong results across mathematics, GPU kernel optimization, algorithm design, and biology.
Background
Test‑time search prompts a frozen large language model (LLM) to generate many candidate solutions, stores past attempts in a buffer, and uses handcrafted heuristics to craft new prompts. This can improve performance, but the LLM itself does not learn.
Why learning beats search
Historical AI milestones such as AlphaGo and AlphaFold demonstrate that learning ultimately surpasses pure search for out‑of‑distribution tasks. Scientific discovery therefore benefits from methods that allow the model to adapt during testing.
Test‑Time Training (TTT‑Discover)
TTT‑Discover treats each test problem as a reinforcement‑learning (RL) environment and continuously trains the LLM on that specific problem. The objective is to obtain a single highest‑reward solution rather than maximizing expected reward across many tasks.
Key technical components
Entropy‑based objective : An exponential‑weighted entropy term drives the policy toward high‑reward samples. The temperature β is adapted per initial state s and a KL‑divergence constraint stabilizes training. As β → ∞ the objective approaches a max‑operator.
PUCT‑inspired state reuse : A scoring function Q(s) selects initial states using the maximum return among child states (R(s) if the state is unseen). This focuses search on the most promising trajectories.
Implementation
Any standard RL algorithm (e.g., PPO, GRPO) can be employed, but vanilla PPO is avoided because it optimizes expected reward rather than the maximum reward needed for scientific discovery. The method was implemented with the open‑source model gpt‑oss‑120b accessed via the Thinking Machines Tinker API.
Resources:
Paper: https://www.alphaxiv.org/abs/2601.16175
GitHub repository: https://github.com/test-time-training/discover
Results
Evaluated on four distinct domains:
Mathematics : On the Erdős minimum overlap problem, TTT‑Discover achieved a new best score of 0.380876, improving over the previous human (0.380927) and AlphaEvolve (0.380924) results.
GPU kernel optimization : Using the GPUMode TriMul benchmark, the discovered kernel ran up to 50 % faster on A100 GPUs and achieved >15 % speedups across all GPU types compared with the best human‑submitted kernel.
Algorithm design (AtCoder) : The approach outperformed both the strongest AI code and top human solutions.
Single‑cell biology denoising : Achieved state‑of‑the‑art performance on a denoising task.
Limitations
TTT‑Discover currently works only on tasks with dense, continuous rewards. Extending it to sparse or binary reward settings—such as formal mathematical proofs, hypothesis generation, or reasoning in physics and biology—remains an open challenge.
Conclusion
By integrating reinforcement learning directly into the inference phase, TTT‑Discover demonstrates that continual learning at test time can unlock higher performance in AI‑driven scientific discovery.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
