From <10% to 70%: How Google’s Iterative Proof Framework Cracked the Putnam Competition

Google’s LEAP framework transforms generic LLMs into an agentic system that iteratively builds proof blueprints, boosting formal theorem‑proving success on Lean‑IMO‑Bench from under 10% to 70% and achieving a perfect 12‑out‑of‑12 score in the 2025 Putnam Competition.

Data Party THU
Data Party THU
Data Party THU
From <10% to 70%: How Google’s Iterative Proof Framework Cracked the Putnam Competition

Problem

Formal theorem proving in Lean has a native single‑shot success rate of less than 10 % for generic large language models, which has driven the community toward heavily fine‑tuned specialized provers.

LEAP framework

LEAP (Supercharging LLMs for Formal Mathematics with Agentic Frameworks) is an agentic system that requires no task‑specific fine‑tuning. Its workflow is:

Attempt direct formalization of the target theorem.

If the attempt fails, generate an informal “blueprint” that decomposes the theorem into a sequence of supporting lemmas.

Organize these lemmas and their dependencies in an AND‑OR DAG, which serves both as a progress tracker and as a planner for future informal‑formal steps.

Iteratively refine the DAG, prompting the foundation model to prove individual lemmas and to update the graph until the full proof is completed.

LEAP workflow diagram
LEAP workflow diagram

Lean‑IMO‑Bench

To provide a challenging evaluation set, the authors created Lean‑IMO‑Bench, a benchmark of 60 International Mathematical Olympiad problems formally encoded in Lean. The problems span algebra, combinatorics, number theory, and geometry, and each has been hand‑formalized and independently verified.

Experimental results

On Lean‑IMO‑Bench, the native single‑shot success rate of a generic foundation model rises from below 10 % to 70 % when using LEAP, surpassing the previously leading specialized system Aristotle.

In the 2025 Putnam Competition, conventional single‑shot models achieve a score of 0, whereas LEAP solves all 12 problems, attaining a perfect 100 % score.

Analysis

The authors argue that the low success of prior approaches stems not from insufficient mathematical ability of the models but from the absence of structured interaction with the verifier. By maintaining an evolving AND‑OR DAG, LEAP provides explicit dependency tracking and anticipatory lemma planning, allowing the model to break down complex proofs into tractable sub‑goals.

Implications

These findings demonstrate that generic foundation models, when equipped with appropriate agentic scaffolding, can achieve state‑of‑the‑art performance in highly specialized formal domains without extensive fine‑tuning.

Future directions

Planned work includes exploring hybrid architectures that combine the high‑level reasoning of foundation models with fine‑tuned specialist modules for precise step generation.

Paper: https://arxiv.org/abs/2606.03303

Code example

来源:ScienceAI
本文
约1500字
,建议阅读
5
分钟
用大模型把数字证明拆成一张可迭代的施工图。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsformal theorem provingLEAPagentic frameworkLean-IMO-BenchPutnam Competition
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.