Artificial Intelligence 4 min read

DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models

DeepSeek AI’s new open‑source model DeepSeek‑R1 leverages a novel Group‑Related Policy Optimization (GRPO) reinforcement‑learning framework and multi‑stage training to dramatically boost complex reasoning performance, achieving AIME 2024 Pass@1 scores comparable to OpenAI’s o1 model.

Cognitive Technology Team

Feb 3, 2025

DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models

DeepSeek AI announced DeepSeek‑R1, an open‑source large language model (LLM) designed to rival OpenAI’s o1 on complex reasoning tasks. The model employs a new reinforcement‑learning algorithm called Group‑Related Policy Optimization (GRPO) and a multi‑stage training pipeline to enhance reasoning, especially in mathematics.

GRPO simplifies training by removing the dependence on a separate value‑function model, reducing memory and compute costs. Instead of traditional Proximal Policy Optimization (PPO), GRPO uses the average reward of multiple outputs as a baseline, allowing the model to handle multi‑output scenarios more naturally.

During development, the DeepSeek team built on DeepSeek V3, applying GRPO to unsupervised reasoning text completion and using rule‑based reward models to evaluate format, mathematical correctness, and programming ability. This approach raised DeepSeek‑R1’s AIME 2024 Pass@1 score from 15.6 % to 71.0 %, approaching the performance of OpenAI o1‑0912.

The training process consisted of four key stages: (1) supervised fine‑tuning (SFT) with extensive chain‑of‑thought (CoT) data to stabilize early reinforcement learning; (2) application of GRPO on code and math tasks with accuracy and format rewards to enforce language consistency; (3) rejection sampling (RS) to generate large synthetic datasets for writing and role‑play tasks; and (4) a final GRPO pass combining rule‑based and outcome‑based rewards to improve usefulness and safety.

Notably, DeepSeek avoided Monte‑Carlo Tree Search (MCTS) and complex process‑reward models, finding that simple rule‑based rewards for accuracy and format often outperformed more elaborate schemes. The result is a model that not only shows significant gains in reasoning ability but also exhibits higher practicality and consistency across diverse tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Large Language Models DeepSeek reasoning reinforcement learning GRPO

Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.