Artificial Intelligence 10 min read

Is GRPO Obsolete? Why GLM‑5.2 Dropped It and What It Means for RL

GLM‑5.2 replaces the Group Relative Policy Optimization (GRPO) algorithm with a critic‑based PPO approach for long‑horizon tasks, arguing that GRPO’s group comparison breaks down on variable‑length trajectories, a shift that has sparked vigorous debate across the reinforcement‑learning community.

Machine Heart

Jun 21, 2026

Is GRPO Obsolete? Why GLM‑5.2 Dropped It and What It Means for RL

On June 13, Zhipu announced the open‑source release of GLM‑5.2, a 744‑billion‑parameter MoE model with 40 B active parameters, MIT licensing, and a 1 M‑token context window. On the FrontierSWE benchmark it scored 74.4%, close to Claude Opus 4.8 (75.1%) and surpassing GPT‑5.5 (72.6%).

The most technically significant change is the abandonment of GRPO (Group Relative Policy Optimization) during the model’s long‑horizon reinforcement‑learning stage.

GRPO, introduced by DeepSeek in a 2024 DeepSeekMath paper and validated by DeepSeek‑R1, removes the need for a value network (critic). Instead, the model generates a group of responses (typically dozens) to the same prompt, uses the group’s average reward as a baseline, and treats any answer above that baseline as having positive advantage. This works well for short, verifiable tasks such as math problems or unit tests, where all samples are comparable and the memory cost is low.

Long‑horizon agent tasks, however, produce trajectories of highly variable length after compaction, ranging from a few tokens to dozens of steps. GRPO’s requirement that all outputs be compared within the same group becomes infeasible because the sub‑trajectories cannot be fairly aligned, causing large portions of data to be unusable.

Zhipu’s solution is to “bring the value network back.” GLM‑5.2’s long‑horizon RL now uses a critic‑based PPO, computing token‑level advantage values that can handle arbitrarily sized sub‑trajectories, eliminating the need for group‑wise comparison.

The community reacted quickly. Some developers labeled the move a “critic comeback,” noting that variance‑reduction via group comparison fails after a certain task length and that OpenAI and Anthropic likely already rely on value networks. Others reported that actor‑critic outperformed GRPO in small‑scale experiments, while researchers highlighted that PPO’s generality makes it more robust for extended tasks.

Academic work supports this view. The arXiv paper “Learning Without Critics? Revisiting GRPO in Classical Reinforcement Learning Environments” (2025) found that in long‑horizon tasks without early‑termination mechanisms, GRPO consistently underperforms critic‑based PPO, whereas in short tasks like CartPole the two methods are comparable.

The broader implication is that reinforcement‑learning algorithm selection is becoming task‑specific rather than following a universal default. GRPO and its variants remain effective and cheap for short, verifiable tasks, but value‑network‑based PPO regains importance for multi‑round, sparse‑reward, long‑duration agent tasks.

Overall, GLM‑5.2’s shift highlights a maturation of open‑source LLM training: as models move from answering questions to acting as autonomous agents, the post‑training algorithmic choices must evolve accordingly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models DeepSeek Reinforcement Learning GRPO PPO long-horizon tasks GLM-5.2

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.