Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough

The article examines the Test‑Time Training to Discover (TTT‑Discover) approach, which applies reinforcement learning during inference to let large language models continuously improve on single test problems, and reports strong results across mathematics, GPU kernel optimization, algorithm design, and biology.

Data Party THU
Data Party THU
Data Party THU
Can LLMs Learn While Being Tested? Inside the TTT-Discover Breakthrough

Background

Test‑time search prompts a frozen large language model (LLM) to generate many candidate solutions, stores past attempts in a buffer, and uses handcrafted heuristics to craft new prompts. This can improve performance, but the LLM itself does not learn.

Why learning beats search

Historical AI milestones such as AlphaGo and AlphaFold demonstrate that learning ultimately surpasses pure search for out‑of‑distribution tasks. Scientific discovery therefore benefits from methods that allow the model to adapt during testing.

Test‑Time Training (TTT‑Discover)

TTT‑Discover treats each test problem as a reinforcement‑learning (RL) environment and continuously trains the LLM on that specific problem. The objective is to obtain a single highest‑reward solution rather than maximizing expected reward across many tasks.

Key technical components

Entropy‑based objective : An exponential‑weighted entropy term drives the policy toward high‑reward samples. The temperature β is adapted per initial state s and a KL‑divergence constraint stabilizes training. As β → ∞ the objective approaches a max‑operator.

PUCT‑inspired state reuse : A scoring function Q(s) selects initial states using the maximum return among child states (R(s) if the state is unseen). This focuses search on the most promising trajectories.

Implementation

Any standard RL algorithm (e.g., PPO, GRPO) can be employed, but vanilla PPO is avoided because it optimizes expected reward rather than the maximum reward needed for scientific discovery. The method was implemented with the open‑source model gpt‑oss‑120b accessed via the Thinking Machines Tinker API.

Resources:

Paper: https://www.alphaxiv.org/abs/2601.16175

GitHub repository: https://github.com/test-time-training/discover

Results

Evaluated on four distinct domains:

Mathematics : On the Erdős minimum overlap problem, TTT‑Discover achieved a new best score of 0.380876, improving over the previous human (0.380927) and AlphaEvolve (0.380924) results.

GPU kernel optimization : Using the GPUMode TriMul benchmark, the discovered kernel ran up to 50 % faster on A100 GPUs and achieved >15 % speedups across all GPU types compared with the best human‑submitted kernel.

Algorithm design (AtCoder) : The approach outperformed both the strongest AI code and top human solutions.

Single‑cell biology denoising : Achieved state‑of‑the‑art performance on a denoising task.

Limitations

TTT‑Discover currently works only on tasks with dense, continuous rewards. Extending it to sparse or binary reward settings—such as formal mathematical proofs, hypothesis generation, or reasoning in physics and biology—remains an open challenge.

Conclusion

By integrating reinforcement learning directly into the inference phase, TTT‑Discover demonstrates that continual learning at test time can unlock higher performance in AI‑driven scientific discovery.

Illustration of test‑time training concept
Illustration of test‑time training concept
Diagram of TTT‑Discover workflow
Diagram of TTT‑Discover workflow
LLMReinforcement learningAI researchscientific discoveryTest-Time TrainingTTT-Discover
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.