Open‑Source Models Dominate 21 Scientific Discovery Tasks with SimpleTES

The SimpleTES framework decomposes trial‑and‑error into three scalable dimensions—Concurrency, Length, and Candidates—enabling test‑time scaling that lets open‑source models outperform closed‑source rivals across 21 diverse scientific benchmarks, from LASSO regression to quantum circuit compilation.

Machine Heart
Machine Heart
Machine Heart
Open‑Source Models Dominate 21 Scientific Discovery Tasks with SimpleTES

Faced with a limited budget for scientific discovery, the authors contrast two strategies: allocating all resources to a single powerful model (e.g., OpenAI o1 or DeepSeek) versus constructing a "idea laboratory" that runs dozens or hundreds of hypothesis experiments in parallel, allowing the most promising solutions to emerge.

The core contribution is the SimpleTES framework, which formalizes trial‑and‑error as a three‑dimensional, schedulable process. The dimensions are C (Concurrency) —the number of parallel trajectories, L (Length) —the depth each trajectory explores, and K (K‑candidates) —the number of candidates generated at each step. By dynamically allocating compute across these axes, SimpleTES shifts the bottleneck from model capacity to efficient search cost distribution.

SimpleTES introduces a test‑time scaling loop where an evaluator (scoring function, simulator, or verifier) guides iterative refinement of each trajectory. Instead of optimizing per‑step rewards, the method optimizes the final best outcome of an entire rollout, retaining only the top R% of trajectories and using a replay buffer to accumulate experience. This trajectory‑level post‑training turns the evaluator into a direction controller rather than a simple scorer.

Empirical results demonstrate that, with settings C=32, L=100, K=16, SimpleTES achieves state‑of‑the‑art performance on 21 scientific tasks using only open‑source models (e.g., gpt‑oss). Highlights include:

LASSO path solving : SimpleTES matches glmnet accuracy (error ≤1e‑6) while being 2.17× faster than glmnet and 14× faster than sklearn.

AtCoder programming contests : The system discovers novel multi‑start simulated‑annealing strategies that surpass all human players and existing AI solutions.

Quantum circuit routing : On superconducting and IBM Q20 platforms, SimpleTES reduces SWAP overhead by 21.7% versus SABRE and 14.9% versus LightSABRE, and cuts execution time by up to 33.2% on neutral‑atom architectures.

Erdős minimal‑overlap problem : SimpleTES pushes the best known overlap from 0.38087 to 0.380868, a non‑trivial improvement in a highly sensitive optimization landscape.

The authors note limitations: the approach relies on fast, reliable evaluators, making it less effective for tasks with expensive or noisy feedback; the three scaling dimensions are currently manually tuned; and discrete‑feedback domains (e.g., theorem proving) may suffer from ambiguous scoring signals.

Overall, SimpleTES showcases how scaling the evaluation‑driven trial‑and‑error loop can transform open‑source models into powerful scientific discovery engines, suggesting a future where AI systems not only reason deeply but also explore efficiently.

open-source modelsAI for Sciencetest-time scalingscientific discoveryevaluation-driven searchSimpleTEStrajectory optimization
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.