Artificial Intelligence 21 min read

DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning

DeepSeek‑R1 demonstrates that large‑scale reinforcement learning, especially with the novel Group Relative Policy Optimization and a rule‑based reward scheme, can markedly boost reasoning in LLMs without heavy supervised fine‑tuning, while a brief cold‑start SFT phase, two‑stage alignment, and knowledge distillation further improve performance and efficiency, despite remaining challenges such as language mixing.

Tencent Technical Engineering

Feb 21, 2025

DeepSeek-R1: Enhancing Reasoning Capabilities in LLMs via Reinforcement Learning

The article discusses the DeepSeek-R1 series of large language models, focusing on how reasoning capabilities can be enhanced through large-scale reinforcement learning (RL) without relying on extensive supervised fine‑tuning (SFT). It introduces DeepSeek‑R1‑Zero, a model trained purely via RL using Group Relative Policy Optimization (GRPO), which eliminates the need for a critic model and reduces training cost.

The work shows that even a small amount of SFT for cold‑start can further improve performance. DeepSeek‑R1 builds on this by adding a cold‑start phase with high‑quality chain‑of‑thought data before large‑scale RL, yielding better readability and reasoning performance.

Key technical components include GRPO for efficient RL, a rule‑based reward system combining accuracy and format rewards, language‑consistency rewards to mitigate multilingual mixing, rejection sampling for generating SFT data, and a two‑stage RL alignment process that optimizes helpfulness and harmlessness.

The article also covers model distillation, showing that transferring DeepSeek‑R1’s knowledge to smaller models (e.g., Qwen, Llama) via SFT yields significant gains, whereas pure RL on small models requires massive compute and still falls short of distilled performance.

Finally, it discusses limitations such as language mixing, unsuccessful attempts with process reward models and Monte Carlo tree search, and outlines future directions like exploring ensemble learning and further RL alignment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cold-start GRPO LLM reasoning model distillation DeepSeek-R1

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.