Artificial Intelligence 19 min read

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

This article provides a detailed, easy‑to‑understand analysis of DeepSeek‑Math‑V2’s self‑verification training method and DeepSeek‑V3.2’s GRPO framework, sparse‑attention DSA mechanism, massive agent data pipeline, and benchmark results that place both models among the world’s top open‑source large language models.

Fun with Large Models

Dec 5, 2025

DeepSeek Math V2 & V3.2: A Plain‑Language Deep Dive into Core Innovations

DeepSeek‑Math‑V2: Self‑Verification Mathematical Reasoning

DeepSeek‑Math‑V2 adopts a self‑verification training paradigm in which three models – a generator (student), a validator (teacher), and a meta‑validator (principal) – interact in a continual self‑play loop. The generator produces detailed proof steps, the validator assigns a score (0, 0.5, 1) and explanatory comments, and the meta‑validator audits the validator’s scores for consistency.

Core architecture

Generator : solves mathematical proof problems and outputs a self‑evaluation of its solution.

Validator : evaluates the proof, generates a rating (0, 0.5, 1) and a grading comment.

Meta‑validator : checks the validator’s ratings and comments to ensure reliable grading.

Training procedure

Cold‑start (supervised) stage : over 17,000 expert‑annotated proof examples (each with a score of 0, 0.5, or 1) are used to give the three roles initial competence. All three start from a fine‑tuned DeepSeek‑V3.2 checkpoint.

Step 1 – Train the validator : using the GRPO algorithm, the validator learns to assign scores and write grading comments, effectively learning a chain‑of‑thought for evaluation.

Step 2 – Train the meta‑validator : the meta‑validator learns to assess the validator’s comments, preventing fabricated feedback.

Step 3 – Train the generator : the generator is rewarded by the enhanced validator; it must also produce a self‑evaluation of its proof. Rewards combine proof correctness (3 parts) and self‑evaluation quality (1 part).

Step 4 – Rejection Fine‑Tuning (RFT) : high‑quality (1‑score) data from all three roles are merged into a single “all‑round base model” that can solve, grade, and supervise.

Benchmark results show near‑perfect scores on the CNML high‑school math competition, 11/12 correct on the Putnam contest, and gold‑level performance on the IMP‑ProofBench dataset, surpassing GPT‑5‑Thinking and Gemini Deep‑Think.

DeepSeek‑V3.2: Scalable GRPO Framework and Sparse Attention

Released on 2025‑12‑01, DeepSeek‑V3.2 integrates the self‑verification innovations of Math‑V2 and adds a scalable GRPO training framework, unbiased KL estimation, off‑policy sequence masking, and a synthetic agent dataset comprising approximately 1,800 agents and 850 k prompt‑response pairs. This reinforcement‑learning‑after‑pre‑training corpus is ten times larger than the original pre‑training data.

The model incorporates DeepSeek Sparse Attention (DSA), which reduces inference cost by 30‑70 % while preserving accuracy.

Key technical breakthroughs

Extensible GRPO framework for stable, efficient RL‑after‑pre‑training.

Unbiased KL estimation and off‑policy sequence masking to address long‑cycle stability issues.

Large‑scale synthetic agent data pipeline (≈1,800 agents, 850 k prompt‑response pairs) for diverse high‑quality training data.

DSA sparse‑attention mechanism cutting model‑call cost by up to 70 %.

Model variants

DeepSeek‑V3.2 : general‑purpose flagship model.

DeepSeek‑V3.2‑Speciale : experimental model focused on long‑chain reasoning, trained on pure reasoning data and using the self‑verification method from Math‑V2. Speciale achieves Gemini 3.0 Pro‑level results on major evaluation sets and is scheduled for API retirement on 2025‑12‑15.

Performance

Benchmarks place DeepSeek‑V3.2 on par with GPT‑5 in programming, mathematics, and agent tasks, and within 5 % of Gemini 3.0 and Claude 4.5 on agent performance. On the IMP‑ProofBench dataset, DeepSeek‑Math‑V2 attains gold‑level scores, outperforming Gemini Deep Think.

Evolution roadmap

Late August 2025 – V3.1 released with hybrid reasoning architecture.

End of September 2025 – V3.2‑EXP experimental version introduced DSA, halving inference cost.

End of November 2025 – DeepSeek‑Math‑V2 launched, pioneering self‑verification training.

December 1 2025 – DeepSeek‑V3.2 official release, integrating all prior innovations.

Code example

deepseek-reasoner

LLM DeepSeek GRPO Sparse Attention self‑verification Math V2 V3.2

Written by

Fun with Large Models

Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.