Artificial Intelligence 18 min read

DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning

DeepSeek‑Prover‑V2, an open‑source LLM specialized for Lean 4, bridges intuitive high‑level reasoning and strict formal verification through sub‑goal decomposition, dual operation modes, and a novel cold‑start data pipeline, achieving state‑of‑the‑art results on MiniF2F, PutnamBench and CombiBench while highlighting trade‑offs in inference cost and model scalability.

AI Algorithm Path

May 3, 2025

DeepSeek Prover V2: Pioneering the Next Era of AI‑Driven Formal Math Reasoning

Challenge of Formal Mathematical Reasoning for LLMs

Formal proof assistants such as Lean, Isabelle, and Coq require every inference step to be justified by an axiom or a previously proved theorem. Standard large language models (LLMs) struggle with this because they rely on intuition, skip “obvious” steps, and are trained on informal text. The difficulty is amplified by strict syntax, long‑range dependencies, a huge search space of possible proof strategies, and the scarcity of paired informal‑formal training data.

DeepSeek‑Prover‑V2 Overview

DeepSeek‑Prover‑V2 is an open‑source LLM specialized for Lean 4. It builds on the general‑purpose DeepSeek‑V3 model’s planning ability and adds a dedicated proof‑generation pipeline that unifies informal intuition with formal verification.

Recursive Sub‑Goal Decomposition

The model follows a three‑stage workflow:

Decompose : The target theorem is split into smaller lemmas (sub‑goals) expressed as Lean have … := by sorry statements.

Conquer : Each sub‑goal is handed to a smaller prover (initially a 7‑billion‑parameter model) that searches for a concrete proof.

Combine : Verified sub‑goal proofs are re‑inserted into the original outline, yielding a complete Lean proof.

This mirrors the human practice of proving a theorem by first outlining lemmas, solving them individually, and then assembling the final proof.

Operation Modes

Non‑CoT (non‑chain‑of‑thought) mode : Generates only Lean code without an explicit reasoning plan. It is faster and suited for quick exploration.

CoT (chain‑of‑thought) mode : First outputs a structured plan of intermediate “have” statements, then produces the final Lean proof. This yields higher success rates on difficult problems at the cost of longer outputs.

Training Pipeline

The core innovation is a “cold‑start” data generation process:

For each theorem, DeepSeek‑V3 creates a high‑level proof outline in Lean syntax containing placeholder have … := by sorry statements.

The outline is fed to a small prover that recursively fills each sub‑goal.

Completed sub‑goals are recombined, producing a fully verified Lean proof paired with the original informal outline. This synthetic dataset fuels subsequent training stages.

Training stages:

Expert iteration (non‑CoT) : Verified proofs are added to the training set and the model is fine‑tuned iteratively.

Supervised fine‑tuning (SFT) : Mixes expert‑iteration data with the cold‑start pairs.

Reinforcement learning (RL) sprint : Optimizes the model with Group‑Relative Policy Optimization (GRPO). Rewards are +1 for Lean verification and a consistency reward that forces generated proofs to contain the lemmas proposed in the earlier CoT plan.

Distillation : Knowledge from a 671‑billion‑parameter RL model is distilled into a 7‑billion‑parameter model, extending its context length and applying the same RL loop.

Performance Evaluation

Benchmarks were run with various sampling budgets (pass@k). Results:

MiniF2F (Lean‑formalized competition problems):

671B CoT model: 88.9% pass@8192, 82.4% pass@32, 61.9% pass@1.

7B CoT model: 82.0% pass@8192.

7B non‑CoT model: 23 solved problems (including 13 not solved by the 671B model).

PutnamBench (649 competition problems):

671B CoT model solved 49 problems (pass@1024).

7B CoT model solved 11 problems.

7B non‑CoT model solved 23 problems, discovering 13 problems missed by the larger model.

CombiBench (77 combinatorial problems, answer‑provided setting):

671B CoT model solved 12 problems (pass@16).

Kimina‑Prover solved 7; its 7B variant solved 10.

Across all benchmarks, CoT mode consistently outperformed non‑CoT, especially with higher sampling budgets, but generated longer outputs (e.g., ~6752 tokens for 671B CoT vs ~762 tokens for non‑CoT on MiniF2F), leading to higher inference latency and cost.

Technical Details

GRPO vs PPO : GRPO eliminates a separate critic network by directly comparing rewards of candidate groups, which is more efficient for sparse‑reward tasks like theorem proving.

Consistency reward : Ensures that proofs generated in RL contain the specific lemmas from the CoT plan, preserving alignment between high‑level planning and low‑level code.

Cardinality skill discovery : The 7B non‑CoT model learned to use Cardinal.toNat and Cardinal.natCast_inj on Putnam problems, an ability that emerged during RL and distillation.

Prompt engineering : Non‑CoT prompts request only Lean code; CoT prompts explicitly ask for a structured plan before code generation, demonstrating the impact of prompt design on model behavior.

Role of DeepSeek‑V3 : Provides high‑quality high‑level outlines; its planning capability is crucial for the downstream prover.

Significance

Demonstrates a viable path for bridging informal intuition and formal verification in AI‑assisted theorem proving.

Shows that large‑scale LLM planning combined with specialized proof search can achieve state‑of‑the‑art results on diverse formal benchmarks.

Open‑source release of DeepSeek‑Prover‑V2 and the ProverBench dataset encourages further research in neural theorem proving.

LLM reinforcement learning AI mathematics DeepSeek Prover V2 formal theorem proving Lean 4 MiniF2F PutnamBench

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.