Artificial Intelligence 11 min read

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO

Klear-Reasoner, built on Qwen3‑8B‑Base, introduces the Gradient‑Preserving Clipping Policy Optimization (GPPO) algorithm to overcome traditional clip limitations, achieving state‑of‑the‑art performance on AIME2024/2025 and LiveCodeBench while providing detailed experimental analysis and data‑quality insights.

Kuaishou Large Model

Aug 19, 2025

How Klear-Reasoner Achieves SOTA Math & Code Reasoning with GPPO

In the race for large language model (LLM) capabilities, mathematical and code reasoning have become decisive benchmarks. Building on Qwen3‑8B‑Base, the Klear team released Klear‑Reasoner, an open‑source model that reaches same‑scale SOTA performance on several authoritative tests.

Model Overview

Klear‑Reasoner’s architecture and training pipeline are fully disclosed, with all details and the complete training pipe line made public.

Paper title: Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization

Paper link: https://arxiv.org/pdf/2508.07629

Hugging Face: https://huggingface.co/Suu/Klear-Reasoner-8B

GitHub: https://github.com/suu990901/KlearReasoner/tree/main

Benchmark Performance

On AIME2024 the model achieves 90.5% accuracy, and on AIME2025 it reaches 83.2%, surpassing other open‑source 8B models such as DeepSeek‑R1‑0528‑8B and topping the 8B leaderboard.

GPPO: Gradient‑Preserving Clipping Policy Optimization

Traditional clipping (used in PPO, GRPO, etc.) stabilizes training by limiting policy updates, but it has two hidden drawbacks: (1) high‑entropy tokens—often crucial for exploration—have their gradients discarded when their importance sampling ratio exceeds the clip limit, making the model overly conservative; (2) sub‑optimal trajectories with low importance sampling ratios also lose gradients, slowing convergence because the model must repeat mistakes before receiving corrective signals.

GPPO solves these issues by decoupling the clipping constraint from back‑propagation. It applies a stop‑gradient operation so that clipped tokens still participate in gradient flow, preserving exploration while keeping updates stable. The forward pass remains unchanged (the clipping factor stays at 1), but the backward pass uses a modified gradient expression that retains information from both high‑entropy and negative‑advantage tokens.

Experimental Validation

Compared with DAPO’s clip‑higher strategy and the CISPO method, GPPO consistently outperforms on both mathematical and code tasks. DAPO adjusts the clip upper bound but still suffers from high‑entropy token clipping; CISPO lacks the pessimistic update of PPO, leading to noisier training signals. GPPO inherits PPO’s stable updates while preserving richer gradients.

Data Quality Insights

During supervised fine‑tuning (SFT), data quality proves more effective than sheer quantity. Experiments on top‑K high‑quality math and code subsets show that using only the best 1‑2 sources yields the highest performance, while adding low‑quality data introduces noise that harms learning. Additionally, retaining a portion of flawed reasoning paths can benefit training on hard tasks, as error‑containing samples provide valuable exploration signals.

Reinforcement Learning Phase

Soft rewards based on test‑case pass rates outperform hard binary rewards. Using pass‑rate as a dense reward reduces sparsity, lowers gradient variance, and leads to more stable and efficient learning, as demonstrated on LiveCodeBench V5.

Future Outlook

Klear‑Reasoner not only provides a high‑performing open‑source weight but also offers a reproducible training pipeline that balances stability and exploration via GPPO. This approach is expected to benefit future math, code, and broader RL‑augmented reasoning tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models reinforcement learning code reasoning Math Reasoning gradient clipping GPPO

Written by

Kuaishou Large Model

Official Kuaishou Account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.