Can RL‑Only Training Make LLMs Beat OpenAI‑o1? Inside DeepSeek‑R1’s Architecture and Results

DeepSeek‑R1’s open‑source series demonstrates that reinforcement‑learning‑only training can match top‑tier models like OpenAI‑o1, while a small amount of SFT further improves readability; the article dissects its technical report, training pipeline, reward design, distillation strategy, benchmark outcomes, and remaining challenges.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Can RL‑Only Training Make LLMs Beat OpenAI‑o1? Inside DeepSeek‑R1’s Architecture and Results

Background and Motivation

As the Chinese New Year approaches, DeepSeek released the DeepSeek‑R1 series as a major open‑source effort. The authors claim that the model’s performance on difficult benchmarks rivals OpenAI‑o1‑1217, placing it in the first‑tier of inference models. The article provides a concise interpretation of the technical report.

Main Contributions

DeepSeek‑R1‑Zero achieves strong results using only reinforcement learning (RL) without any supervised fine‑tuning (SFT).

DeepSeek‑R1 adds a small amount of chain‑of‑thought (CoT) data for a cold‑start SFT, then applies RL to further improve performance and alignment with human preferences.

Distilling the DeepSeek‑R1 model into smaller models yields surprisingly good results.

DeepSeek‑R1‑Zero: RL‑Only Training

The model starts from DeepSeek‑V3‑Base and uses DeepSeek’s custom GRPO algorithm with a straightforward PE template.

Reward modeling (RM) for inference tasks combines two reward signals:

Accuracy reward: exact match for math problems; compilation and unit‑test verification for code problems.

Format reward: checks whether the CoT process is wrapped in <think> </think> tags.

Despite the simplicity of this approach, the model shows impressive gains, steadily improving with more training steps and approaching the performance of OpenAI‑o1‑0912.

Training also reveals an “evolution” phenomenon: as steps increase, the average output length grows, indicating that the LLM learns to think and reason more deeply.

However, the RL‑only model suffers from reduced readability and occasional mixed‑language outputs, suggesting that a modest amount of SFT is still beneficial.

DeepSeek‑R1: Adding a Cold‑Start SFT Stage

Building on the findings of DeepSeek‑R1‑Zero, DeepSeek‑R1 introduces four stages to further enhance the model:

1. Small‑Scale Data Cold‑Start

A few thousand high‑quality CoT samples are collected using few‑shot prompting of DeepSeek‑R1‑Zero, followed by human post‑processing. This modest SFT improves readability and, when combined with RL, boosts reasoning ability.

2. RL on Reasoning Scenarios

RL is applied to math, code, and logical reasoning tasks using the same reward scheme as DeepSeek‑R1‑Zero. An additional language‑consistency reward measures the proportion of the target language in the output, addressing mixed‑language issues.

3. Rejection Sampling and SFT

Two data streams are prepared:

Reasoning data: generated via rejection sampling from the previous stage, plus samples that cannot be rewarded by rule‑based methods (rewarded by a LLM‑as‑judge approach). CoT samples with mixed language, long paragraphs, or code blocks are filtered out, yielding ~600k samples.

Non‑reasoning data: 200k samples generated by DeepSeek‑V3 and its SFT data.

These 800k samples undergo two epochs of SFT on DeepSeek‑V3‑Base.

4. Unified RL for All Scenarios

A final RL pass balances reasoning and general capabilities. Different prompts and rewards are used for reasoning versus general data, reusing DeepSeek‑V3’s reward model for the latter. The combined reward signal and diverse data distribution preserve strong inference ability while improving usefulness and safety.

Experimental results show that DeepSeek‑R1 matches or exceeds OpenAI‑o1‑1217 across benchmarks.

Distilling Smaller Models

Using the data from the “Rejection Sampling and SFT” stage, DeepSeek‑R1 directly SFTs smaller models without the RL phase, achieving surprisingly good performance.

Discussion

Distillation vs. RL

Distillation proves cheaper and more practical; a small model undergoing SFT + RL still lags behind directly distilling a stronger model and then applying SFT.

Unsuccessful Attempts

PRM : difficulties defining fine‑grained steps, scaling annotation, and reward‑hacking issues.

MCTS : large search space leading to local optima and challenges training a fine‑grained value model.

Future Directions

General Capability : DeepSeek‑R1 still lags behind DeepSeek‑V3; future work will explore leveraging long CoT for broader tasks.

Language Mixing : Current optimizations focus on Chinese and English; handling other languages remains a challenge.

Prompt Sensitivity (PE) : The model is highly sensitive to prompts; few‑shot examples often degrade performance, so zero‑shot with explicit output format is recommended.

Software‑Engineering Tasks : Large‑scale RL is inefficient for these tasks; future versions will explore rejection sampling or asynchronous evaluation to improve efficiency.

References

https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf
DeepSeekLarge Language Modelreinforcement learningmodel distillationSupervised Fine‑Tuning
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.