How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO
Klear-Reasoner, built on Qwen3‑8B‑Base, introduces the Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm to overcome traditional clip limitations, achieving state‑of‑the‑art performance on AIME2024/2025 and LiveCodeBench while providing detailed experimental analysis and data‑quality insights.
In the race for large language model (LLM) capabilities, mathematical and code reasoning have become decisive benchmarks. Building on Qwen3‑8B‑Base, the Klear team released Klear‑Reasoner, an open‑source model that reaches same‑scale SOTA performance on several authoritative tests.
Model Overview
Klear‑Reasoner’s architecture and training pipeline are fully disclosed, with all details and the complete training pipe line made public.
Paper title: Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
Paper link: https://arxiv.org/pdf/2508.07629
Hugging Face: https://huggingface.co/Suu/Klear-Reasoner-8B
GitHub: https://github.com/suu990901/KlearReasoner/tree/main
Benchmark Performance
On AIME2024 the model achieves 90.5% accuracy, and on AIME2025 it reaches 83.2%, surpassing other open‑source 8B models such as DeepSeek‑R1‑0528‑8B and topping the 8B leaderboard.
GPPO: Gradient‑Preserving Clipping Policy Optimization
Traditional clipping (used in PPO, GRPO, etc.) stabilizes training by limiting policy updates, but it has two hidden drawbacks: (1) high‑entropy tokens—often crucial for exploration—have their gradients discarded when their importance sampling ratio exceeds the clip limit, making the model overly conservative; (2) sub‑optimal trajectories with low importance sampling ratios also lose gradients, slowing convergence because the model must repeat mistakes before receiving corrective signals.
GPPO solves these issues by decoupling the clipping constraint from back‑propagation. It applies a stop‑gradient operation so that clipped tokens still participate in gradient flow, preserving exploration while keeping updates stable. The forward pass remains unchanged (the clipping factor stays at 1), but the backward pass uses a modified gradient expression that retains information from both high‑entropy and negative‑advantage tokens.
Experimental Validation
Compared with DAPO’s clip‑higher strategy and the CISPO method, GPPO consistently outperforms on both mathematical and code tasks. DAPO adjusts the clip upper bound but still suffers from high‑entropy token clipping; CISPO lacks the pessimistic update of PPO, leading to noisier training signals. GPPO inherits PPO’s stable updates while preserving richer gradients.
Data Quality Insights
During supervised fine‑tuning (SFT), data quality proves more effective than sheer quantity. Experiments on top‑K high‑quality math and code subsets show that using only the best 1‑2 sources yields the highest performance, while adding low‑quality data introduces noise that harms learning. Additionally, retaining a portion of flawed reasoning paths can benefit training on hard tasks, as error‑containing samples provide valuable exploration signals.
Reinforcement Learning Phase
Soft rewards based on test‑case pass rates outperform hard binary rewards. Using pass‑rate as a dense reward reduces sparsity, lowers gradient variance, and leads to more stable and efficient learning, as demonstrated on LiveCodeBench V5.
Future Outlook
Klear‑Reasoner not only provides a high‑performing open‑source weight but also offers a reproducible training pipeline that balances stability and exploration via GPPO. This approach is expected to benefit future math, code, and broader RL‑augmented reasoning tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
