Artificial Intelligence 19 min read

How DeepSeek R1 Replicates OpenAI o1 Using Large‑Scale Reinforcement Learning

The article provides an in‑depth technical analysis of DeepSeek R1, explaining how it reproduces OpenAI o1's reasoning abilities through rule‑based large‑scale reinforcement learning, mixed SFT data, and efficient scaling, while discussing its broader impact on AI model development and capability density trends.

Architects' Tech Alliance

Feb 9, 2025

How DeepSeek R1 Replicates OpenAI o1 Using Large‑Scale Reinforcement Learning

DeepSeek R1 Overview

DeepSeek R1 is an open‑source large language model that matches the reasoning performance of OpenAI o1 by applying large‑scale reinforcement learning (RL) on top of the DeepSeek V3 base model.

Training Pipeline

Generate supervised fine‑tuning (SFT) data that embed step‑by‑step reasoning. This data is mixed with conventional SFT corpora and used to fine‑tune the V3 base model, producing a checkpoint called DeepSeek‑R1‑Zero.

Apply a rule‑driven, scalable RL algorithm (e.g., PPO with rule‑based reward shaping) to the fine‑tuned model. The rule‑based framework defines reward functions for tasks without obvious external signals and enables RL to be run on models with billions of parameters.

Iterate between RL and SFT to improve cross‑task reasoning generalisation.

Key Technical Contributions

Rule‑driven large‑scale RL : a deterministic rule system makes reward computation tractable at scale, allowing RL to be applied to models of the size of DeepSeek V3 (hundreds of billions of parameters).

Mixed SFT data for reasoning generalisation : injecting detailed reasoning annotations into the SFT set teaches the model to produce interpretable reasoning chains, which are then reinforced by RL, yielding strong performance on unseen tasks.

Capability Density (Densing Law)

Capability density is defined as the ratio of a model’s evaluation performance (e.g., average benchmark score) to its parameter count or active‑parameter count. Empirically, capability density doubles roughly every 100 days, analogous to Moore’s law for chips. The observed trend is attributed to three factors:

Higher data quality through rigorous data governance.

Sparse‑activation architectures (e.g., Mixture‑of‑Experts) that reduce the number of active parameters per inference step.

Advanced learning methods, including scaling predictions and extensive “wind‑tunnel” experiments that optimise data‑to‑parameter ratios before training.

Reference:

https://arxiv.org/pdf/2412.04315v2

Architectural Considerations

DeepSeek V3 uses a Mixture‑of‑Experts (MoE) backbone, providing sparse activation. While MoE offers efficiency gains, the authors argue that it is not a guaranteed path to AGI; diverse architectures should continue to be explored.

Efficiency Implications

The combination of rule‑based RL and mixed SFT reduces both training and inference costs. By improving capability density, the same level of performance can be achieved with roughly half the parameters and compute after each 100‑day interval.

Practical Resources

Model weights, training scripts and evaluation code are released publicly. The repository can be cloned from:

git clone https://github.com/DeepSeek-AI/DeepSeek-R1.git

Release assets and documentation are available at the same URL.

Illustrations

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models DeepSeek AI industry reinforcement learning Model Scaling Capability Density

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.