Artificial Intelligence 7 min read

DeepSeekMath‑V2 Scores 118/120 on Putnam and Achieves Gold‑Level IMO Performance

DeepSeekMath‑V2, released open‑source on 27 Nov 2025, attains gold‑level results on IMO 2025, scores 118 out of 120 on the Putnam 2024 competition, introduces a generator‑verifier self‑verification architecture, uses GRPO training, and outperforms leading closed‑source models on IMO‑ProofBench.

ShiZhen AI

Nov 28, 2025

DeepSeekMath‑V2 Scores 118/120 on Putnam and Achieves Gold‑Level IMO Performance

DeepSeekMath‑V2 benchmark performance

DeepSeekMath‑V2, a 685 billion‑parameter model released on Hugging Face, achieved the following official competition results:

IMO 2025 (International Math Olympiad) : gold‑level, solved 5 of 6 problems.

CMO 2024 (Chinese Math Olympiad) : gold‑level, solved 4 problems.

Putnam 2024 : 118 / 120 points, a near‑perfect score in the world’s hardest undergraduate math contest.

System 1 vs. System 2 reasoning

Current large language models mainly exhibit “System 1” behavior—fast, intuitive responses such as answering 2 + 2 = 4 without explicit reasoning. Complex theorem proving requires “System 2” behavior—slow, deliberative processes that involve logical deduction, back‑tracking, and step‑by‑step verification. DeepSeekMath‑V2 is positioned as a System 2 model that prioritizes thoughtful reasoning over immediate answer generation.

Generator‑Verifier dual‑core architecture

The model adopts a two‑component design:

Generator (student) : produces proof steps and a chain‑of‑thought.

Verifier (teacher) : inspects each generated line, flags logical gaps, and forces rewrites until the proof is internally consistent.

This self‑verification loop gives the model meta‑cognitive capability—recognizing when it does not know or has made an error.

IMO‑ProofBench comparative results

On the IMO‑ProofBench benchmark, the following scores were reported:

DeepSeek V2 (open‑source) : 61.9 % accuracy.

Gemini Think (closed‑source) : 65.7 % accuracy.

GPT‑5 (citation, closed‑source) : 20.0 % accuracy.

Claude 3.5 (closed‑source) : 4.8 % accuracy.

These figures show that an open‑source model can match or challenge the performance of leading closed‑source systems on top‑tier mathematical logic tasks.

GRPO (Group‑Relative Policy Optimization) training method

Traditional reinforcement learning for language models relies on an external value model to score each step, which is costly. GRPO replaces the external critic by having the model generate a set of candidate answers, comparing them internally, rewarding those that exceed the group average and penalizing those below it. This “internal race” enables strong training signals with minimal computational resources.

Deployment configurations

Full‑precision (FP16) version : requires approximately 1.5 TB of VRAM, suitable for H100‑class GPU clusters.

Quantized 4‑bit version : reduces VRAM demand to about 386 GB, allowing execution on smaller laboratory setups.

API access : priced at roughly 1 % of GPT‑4 usage, providing a cost‑effective way to query the model.

Future lite version : planned to run on consumer GPUs such as RTX 4090 or RTX 3060.

LLM benchmark GRPO math reasoning self‑verification DeepSeekMath-V2

Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

DeepSeekMath‑V2 benchmark performance

System 1 vs. System 2 reasoning

Generator‑Verifier dual‑core architecture

IMO‑ProofBench comparative results

GRPO (Group‑Relative Policy Optimization) training method

Deployment configurations

ShiZhen AI

How this landed with the community

Was this worth your time?

0 Comments

System 1 vs. System 2 reasoning