Artificial Intelligence 14 min read

Can AI Really Prove Math? Inside LongCat‑Flash‑Prover’s Breakthrough

LongCat‑Flash‑Prover, an open‑source AI model that decomposes theorem proving into auto‑formalization, sketching, and proving with tool‑integrated reasoning, achieves SOTA results on MiniF2F‑Test (97.1% with only 72 inference steps) and strong performance on MathOlympiad‑Bench and PutnamBench, demonstrating that AI can move from guessing answers to rigorous, verifiable mathematical proofs.

Meituan Technology Team

Apr 2, 2026

Can AI Really Prove Math? Inside LongCat‑Flash‑Prover’s Breakthrough

Large language models excel at generating text and code, yet they struggle with the strict logical chains required for formal theorem proving. To bridge this gap, the research team released LongCat‑Flash‑Prover, an open‑source model specifically designed for mathematical formalization and proof generation.

Model Architecture and Core Capabilities

LongCat‑Flash‑Prover breaks the proof process into three atomic abilities:

Auto‑Formalization (Auto‑Formalization) – “Translate the problem”: Converts natural‑language statements into Lean4‑compatible formal statements.

Sketching – “Draft the solution”: Generates a high‑level outline that splits a complex theorem into lemmas and a main proof body.

Proving – “Fill in the details”: Completes the logical steps for each lemma, producing a full Lean4 proof.

These abilities are orchestrated by a Tool‑Integrated Reasoning (TIR) framework that iteratively combines specialist expert models, allowing single‑round and multi‑round reasoning with tool feedback.

Hybrid Expert Iteration Framework

The training pipeline consists of two stages:

Cold‑Start Phase: An in‑house Auto‑Formalizer (ATF‑32B) generates formal statements, which are refined by LongCat‑Flash‑Thinking‑2601 using Lean4 Server and semantic consistency checks. The resulting high‑quality trajectories form a curated cold‑start dataset, merged via mixed‑domain SFT.

Iteration Phase: The cold‑start model becomes the new expert; synthetic trajectories are generated, enriched with general data, and refined through successive SFT and RL rounds, ultimately yielding the final LongCat‑Flash‑Prover.

Tooling and Safety Mechanisms

The system relies on several verification tools:

Lean4 Server: Checks syntax and proof validity, providing precise error locations.

Semantic Consistency: An LLM‑as‑Judge model ensures the formal statement matches the original informal problem.

Theorem Consistency: Prevents the model from altering the target theorem during proof generation.

Legality Verification: Detects nine common cheating behaviors such as inserting #exit, fabricating axioms, or modifying the problem statement.

Data Synthesis Workflow

For each informal problem, the pipeline follows these steps:

Generate N formal statements via the Auto‑Formalization expert; keep those that are syntactically correct and semantically aligned.

If none are correct, activate TIR to iteratively refine statements using Lean4 Server and semantic scoring.

Attempt Whole‑Proof generation; evaluate N candidates with Lean4 Server and theorem consistency.

If Whole‑Proof fails, switch to Sketch‑Proof: generate N sketches, each containing lemmas and a main body, refined via TIR.

Prove each lemma individually using the Prover expert under TIR.

Training Stabilization Techniques

During RL training, two sources of train‑inference mismatch were addressed:

Sequence‑level Masking: Compute the geometric mean of importance‑sampling ratios (IS Ratio) across a sequence; discard sequences with excessive mismatch.

Token‑level Masking and Staleness Control: Remove tokens with high train‑inference divergence and apply gradient clipping to limit stale updates.

Evaluation and Results

LongCat‑Flash‑Prover sets new SOTA on all automatic formalization benchmarks, achieving 100% scores on MiniF2F‑Test and ProofNet. With only 72 inference steps, it reaches 97.1% pass rate on MiniF2F‑Test, far surpassing existing open‑source models. On challenging competition‑level suites, it attains 46.7% (MathOlympiad‑Bench), 52.2% (ProofNet), 70.8% (ProverBench), and 41.5% (PutnamBench).

Sketching improves accuracy by roughly 10% under the same computational budget, confirming the benefit of breaking proofs into lemmas.

Conclusions

LongCat‑Flash‑Prover demonstrates that AI can transition from answer‑guessing to producing fully verifiable mathematical proofs, positioning AI as a potential “infrastructure” for mathematical research, education, and discovery. The model and all associated data are fully open‑source, inviting collaboration from the academic and open‑source communities.

formal verification AI theorem proving Lean4 tool‑integrated reasoning

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.