100k‑Token Natural‑Language Reasoning Enables a 30B‑A3B Model to Reach Olympiad Gold Level
A 30B‑A3B model, trained with reverse‑perplexity supervised fine‑tuning, two‑stage reinforcement learning, and a multi‑round generate‑verify‑revise inference loop, achieves gold‑medal performance on IMO, USAMO and IPhO contests using over 100 k token natural‑language reasoning without external tools.
The Shanghai AI Lab released a technical report showing that a 30B‑A3B reasoning backbone, when enhanced with unified training and inference extensions, can solve high‑difficulty Olympiad math and physics problems using natural‑language reasoning alone.
Training pipeline. The team first performed reverse‑perplexity curriculum supervised fine‑tuning on ~338,000 high‑quality reasoning trajectories, incorporating self‑verification and self‑correction samples. Samples were presented from high to low model perplexity, encouraging the model to first tackle proofs that differed most from its current policy.
Two‑stage reinforcement learning. Stage 1 used verifiable questions with reliable reward signals to boost direct solving ability. Stage 2 shifted the reward focus from answer correctness to proof completeness, introducing a proof‑quality reward model, self‑correction tasks, and experience replay to retain rare high‑value proof traces.
Inference‑time expansion. During problem solving, the model does not produce a single answer; instead it iterates through a "generate‑candidate‑answer → check‑complete‑proof → locate‑issues → revise‑answer" loop, performing verification and correction entirely in natural language.
The resulting model, named SU‑01, achieved median generation lengths of about 106 k tokens initially and 83 k tokens after self‑correction on USAMO 2026 inference trajectories, demonstrating that long‑form reasoning budgets can be reliably transformed into proof search and self‑verification capabilities.
In competition‑style evaluation, SU‑01 scored 35 points on both IMO 2025 and USAMO 2026, reaching the gold‑medal thresholds (35/28/19 for IMO, 25/18/11 for USAMO). Notably, on USAMO 2026 problem 3—where human average score was 0.01 and no one exceeded 5 points—the model obtained a perfect score.
On the IMO‑ProofBench, direct generation yielded 57.6% accuracy, which rose to 70.2% after inference expansion, surpassing other models of similar size and approaching the 72.6% of Gemini 3.1 Pro Thinking.
Beyond Olympiad benchmarks, SU‑01 performed best among same‑size models on the FrontierScience‑Research suite, indicating potential for broader scientific problem solving.
Detailed case studies show the model employing unconventional strategies, such as using complex numbers to solve a geometry problem (USAMO 2026 P3) and reducing a two‑circle geometry problem to coordinate calculations (IMO 2025 P2), as well as dynamic‑programming and number‑theoretic techniques on other USAMO problems.
In physics, SU‑01 exceeded the gold‑medal line on IPhO 2024/2025, with further gains after inference expansion.
The authors argue that the key to Olympiad‑level scientific reasoning is not model scale but the ability to convert extensive reasoning budgets into stable proof search, verification, and repair mechanisms.
Overall, the work demonstrates a more efficient route to scientific reasoning systems: starting from an existing reasoning model, shaping rigorous reasoning behavior, designing proof‑level rewards, and closing the generate‑verify‑revise loop at inference to turn limited computational budgets into verifiable proof capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
