Artificial Intelligence 11 min read

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization

The Klear‑Reasoner model, built on Qwen3‑8B‑Base and powered by the novel Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm, surpasses same‑size open‑source baselines on challenging math (AIME) and code (LiveCodeBench) benchmarks, while revealing key insights on data quality, reward design, and clipping strategies for large‑language‑model reasoning.

Kuaishou Tech

Aug 18, 2025

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization

In the race for large language model (LLM) superiority, mathematical and code reasoning have become decisive performance gaps. Early efforts such as OpenAI's RLHF and DeepSeek's GRPO highlighted the promise of reinforcement learning for reasoning models, yet reproducing top results remains difficult for many open‑source projects.

The Kuaishou Klear team recently released Klear‑Reasoner , a model built on Qwen3‑8B‑Base . It reaches state‑of‑the‑art (SOTA) performance on several authoritative benchmarks, including AIME2024 (90.5% accuracy), AIME2025 (83.2% accuracy), and LiveCodeBench V5/V6, overtaking strong open‑source competitors such as DeepSeek‑R1‑0528‑8B.

1. Hidden Costs of Traditional Clip

Standard clipping in PPO/GRPO stabilizes training by limiting policy updates, but it introduces two problems: (1) high‑entropy tokens that are crucial for exploration get their gradients discarded when their importance exceeds the clip limit, reducing exploration; (2) sub‑optimal trajectories with low importance are also clipped, delaying convergence because the model must repeat mistakes to accumulate corrective signals.

2. GPPO: Gradient‑Preserving Clipping

The Klear team proposes GPPO (Gradient‑Preserving Clipping Policy Optimization) , which decouples clipping from gradient back‑propagation. Instead of discarding gradients, GPPO applies a stop_gradient to the clipping operation, keeping the forward pass unchanged (value remains 1) while allowing clipped tokens to contribute to the backward pass. This preserves exploration signals and accelerates error correction.

3. Experimental Validation

GPPO consistently outperforms baselines such as DAPO‑Clip‑Higher and CISPO on both math and code tasks. Tables show that GPPO achieves the highest scores across multiple runs, while DAPO merely raises the clip upper bound without solving the underlying token‑clipping issue. CISPO retains PPO’s pessimistic updates but lacks GPPO’s gradient preservation.

4. Additional Insights

SFT stage: Data quality outweighs sheer quantity. Experiments on top‑K high‑quality math and code datasets demonstrate that a small amount of clean data yields better supervision than large noisy corpora.

SFT stage – error tolerance: For hard tasks, retaining a portion of flawed reasoning paths can improve performance, as error examples provide valuable learning signals in low‑signal regimes.

RL stage – soft vs. hard reward: Using a soft reward proportional to test‑case pass rate dramatically outperforms a binary hard reward, reducing reward sparsity and gradient variance.

RL stage – code data filtering: Filtering out code samples with failing test cases (keeping only those with pass@16 > 0.5) yields a noticeable boost in LiveCodeBench V5 performance.

Effect of data filtering on RL performance

5. Future Outlook

Klear‑Reasoner not only provides a high‑performing open‑source weight but also a reproducible training pipeline that balances stability and exploration via GPPO. The approach is expected to benefit future research on mathematical, coding, and broader RL‑augmented reasoning tasks.

LLM reinforcement learning open-source code reasoning math reasoning GPPO

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.