Artificial Intelligence 4 min read

DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models

DeepSeek AI’s new open‑source model DeepSeek‑R1 leverages a novel Group‑Related Policy Optimization (GRPO) reinforcement‑learning framework and multi‑stage training to dramatically boost complex reasoning performance, achieving AIME 2024 Pass@1 scores comparable to OpenAI’s o1 model.

Cognitive Technology Team
Cognitive Technology Team
Cognitive Technology Team
DeepSeek R1 Introduces Group‑Related Policy Optimization for Advanced Reasoning in Large Language Models

DeepSeek AI announced DeepSeek‑R1, an open‑source large language model (LLM) designed to rival OpenAI’s o1 on complex reasoning tasks. The model employs a new reinforcement‑learning algorithm called Group‑Related Policy Optimization (GRPO) and a multi‑stage training pipeline to enhance reasoning, especially in mathematics.

GRPO simplifies training by removing the dependence on a separate value‑function model, reducing memory and compute costs. Instead of traditional Proximal Policy Optimization (PPO), GRPO uses the average reward of multiple outputs as a baseline, allowing the model to handle multi‑output scenarios more naturally.

During development, the DeepSeek team built on DeepSeek V3, applying GRPO to unsupervised reasoning text completion and using rule‑based reward models to evaluate format, mathematical correctness, and programming ability. This approach raised DeepSeek‑R1’s AIME 2024 Pass@1 score from 15.6 % to 71.0 %, approaching the performance of OpenAI o1‑0912.

The training process consisted of four key stages: (1) supervised fine‑tuning (SFT) with extensive chain‑of‑thought (CoT) data to stabilize early reinforcement learning; (2) application of GRPO on code and math tasks with accuracy and format rewards to enforce language consistency; (3) rejection sampling (RS) to generate large synthetic datasets for writing and role‑play tasks; and (4) a final GRPO pass combining rule‑based and outcome‑based rewards to improve usefulness and safety.

Notably, DeepSeek avoided Monte‑Carlo Tree Search (MCTS) and complex process‑reward models, finding that simple rule‑based rewards for accuracy and format often outperformed more elaborate schemes. The result is a model that not only shows significant gains in reasoning ability but also exhibits higher practicality and consistency across diverse tasks.

AIlarge language modelsDeepSeekreasoningreinforcement learningGRPO
Cognitive Technology Team
Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.