How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO Optimization
The Klear‑Reasoner model, built on Qwen3‑8B‑Base and powered by the novel Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm, surpasses same‑size open‑source baselines on challenging math (AIME) and code (LiveCodeBench) benchmarks, while revealing key insights on data quality, reward design, and clipping strategies for large‑language‑model reasoning.
In the race for large language model (LLM) superiority, mathematical and code reasoning have become decisive performance gaps. Early efforts such as OpenAI's RLHF and DeepSeek's GRPO highlighted the promise of reinforcement learning for reasoning models, yet reproducing top results remains difficult for many open‑source projects.
The Kuaishou Klear team recently released Klear‑Reasoner , a model built on Qwen3‑8B‑Base . It reaches state‑of‑the‑art (SOTA) performance on several authoritative benchmarks, including AIME2024 (90.5% accuracy), AIME2025 (83.2% accuracy), and LiveCodeBench V5/V6, overtaking strong open‑source competitors such as DeepSeek‑R1‑0528‑8B.
1. Hidden Costs of Traditional Clip
Standard clipping in PPO/GRPO stabilizes training by limiting policy updates, but it introduces two problems: (1) high‑entropy tokens that are crucial for exploration get their gradients discarded when their importance exceeds the clip limit, reducing exploration; (2) sub‑optimal trajectories with low importance are also clipped, delaying convergence because the model must repeat mistakes to accumulate corrective signals.
2. GPPO: Gradient‑Preserving Clipping
The Klear team proposes GPPO (Gradient‑Preserving Clipping Policy Optimization) , which decouples clipping from gradient back‑propagation. Instead of discarding gradients, GPPO applies a stop_gradient to the clipping operation, keeping the forward pass unchanged (value remains 1) while allowing clipped tokens to contribute to the backward pass. This preserves exploration signals and accelerates error correction.
3. Experimental Validation
GPPO consistently outperforms baselines such as DAPO‑Clip‑Higher and CISPO on both math and code tasks. Tables show that GPPO achieves the highest scores across multiple runs, while DAPO merely raises the clip upper bound without solving the underlying token‑clipping issue. CISPO retains PPO’s pessimistic updates but lacks GPPO’s gradient preservation.
4. Additional Insights
SFT stage: Data quality outweighs sheer quantity. Experiments on top‑K high‑quality math and code datasets demonstrate that a small amount of clean data yields better supervision than large noisy corpora.
SFT stage – error tolerance: For hard tasks, retaining a portion of flawed reasoning paths can improve performance, as error examples provide valuable learning signals in low‑signal regimes.
RL stage – soft vs. hard reward: Using a soft reward proportional to test‑case pass rate dramatically outperforms a binary hard reward, reducing reward sparsity and gradient variance.
RL stage – code data filtering: Filtering out code samples with failing test cases (keeping only those with pass@16 > 0.5) yields a noticeable boost in LiveCodeBench V5 performance.
5. Future Outlook
Klear‑Reasoner not only provides a high‑performing open‑source weight but also a reproducible training pipeline that balances stability and exploration via GPPO. The approach is expected to benefit future research on mathematical, coding, and broader RL‑augmented reasoning tasks.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
