From <10% to 70%: How Google’s Iterative Proof Framework Cracked the Putnam Competition
Google’s LEAP framework transforms generic LLMs into an agentic system that iteratively builds proof blueprints, boosting formal theorem‑proving success on Lean‑IMO‑Bench from under 10% to 70% and achieving a perfect 12‑out‑of‑12 score in the 2025 Putnam Competition.
Problem
Formal theorem proving in Lean has a native single‑shot success rate of less than 10 % for generic large language models, which has driven the community toward heavily fine‑tuned specialized provers.
LEAP framework
LEAP (Supercharging LLMs for Formal Mathematics with Agentic Frameworks) is an agentic system that requires no task‑specific fine‑tuning. Its workflow is:
Attempt direct formalization of the target theorem.
If the attempt fails, generate an informal “blueprint” that decomposes the theorem into a sequence of supporting lemmas.
Organize these lemmas and their dependencies in an AND‑OR DAG, which serves both as a progress tracker and as a planner for future informal‑formal steps.
Iteratively refine the DAG, prompting the foundation model to prove individual lemmas and to update the graph until the full proof is completed.
Lean‑IMO‑Bench
To provide a challenging evaluation set, the authors created Lean‑IMO‑Bench, a benchmark of 60 International Mathematical Olympiad problems formally encoded in Lean. The problems span algebra, combinatorics, number theory, and geometry, and each has been hand‑formalized and independently verified.
Experimental results
On Lean‑IMO‑Bench, the native single‑shot success rate of a generic foundation model rises from below 10 % to 70 % when using LEAP, surpassing the previously leading specialized system Aristotle.
In the 2025 Putnam Competition, conventional single‑shot models achieve a score of 0, whereas LEAP solves all 12 problems, attaining a perfect 100 % score.
Analysis
The authors argue that the low success of prior approaches stems not from insufficient mathematical ability of the models but from the absence of structured interaction with the verifier. By maintaining an evolving AND‑OR DAG, LEAP provides explicit dependency tracking and anticipatory lemma planning, allowing the model to break down complex proofs into tractable sub‑goals.
Implications
These findings demonstrate that generic foundation models, when equipped with appropriate agentic scaffolding, can achieve state‑of‑the‑art performance in highly specialized formal domains without extensive fine‑tuning.
Future directions
Planned work includes exploring hybrid architectures that combine the high‑level reasoning of foundation models with fine‑tuned specialist modules for precise step generation.
Paper: https://arxiv.org/abs/2606.03303
Code example
来源:ScienceAI
本文
约1500字
,建议阅读
5
分钟
用大模型把数字证明拆成一张可迭代的施工图。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
