How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits

This article reviews the most notable open‑source reproductions of DeepSeek‑R1—including Open R1, OpenThoughts, LIMO and DeepScaleR—detailing their data pipelines, training steps, reinforcement‑learning strategies, dataset constructions, and benchmark results that demonstrate how small, high‑quality data can rival massive‑scale models.

Architect
Architect
Architect
How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits

Since DeepSeek‑R1 was released it quickly became a national‑level product, sparking a wave of open‑source reproductions. This article aggregates the most notable implementations, analyzes their pipelines, datasets, training methods, and performance.

1. Open R1: Full‑process reproduction on HuggingFace

Open R1, initiated by HuggingFace, aims to fill the missing parts of the R1 pipeline so anyone can reproduce and build on it. The project follows three steps:

Step 1: Distill high‑quality data from DeepSeek‑R1 to create the Bespoke‑Stratos‑17k dataset.

Step 2: Reproduce the pure‑reinforcement‑learning training of R1‑Zero, including inference‑data generation.

Step 3: Reproduce the full R1 training pipeline, including two‑stage SFT and two‑stage RL.

Step 1 – Data creation : Using the Bespoke Curator project, Bespoke‑Stratos‑17k was built in 1.5 hours for $800. It consists of 5k APPs + TACO, 10k NuminaMATH (AIME, MATH, Olympiads), and 1k STILL‑2 scientific and puzzle data. The construction pipeline includes:

Generating synthetic data with Bespoke Curator.

Reject‑sampling to filter out incorrect solution trajectories, accelerated with a Ray cluster.

Applying GPT‑4o‑mini to filter wrong math solutions, raising the correct‑solution ratio from 25 % to 73 %.

Training on this data produced Bespoke‑Stratos‑32B and Bespoke‑Stratos‑7B models, whose Distill‑Qwen‑32B performance is very close to DeepSeek‑R1‑Distill‑Qwen‑32B (see Figure 2).

DeepSeek‑R1‑Distill‑Qwen‑32B comparison
DeepSeek‑R1‑Distill‑Qwen‑32B comparison

2.1 Step 1 – Reproducing DeepSeek‑R1‑Distill

The distilled dataset Bespoke‑Stratos‑17k enabled training of Bespoke‑Stratos‑32B and Bespoke‑Stratos‑7B . Their performance matches the original DeepSeek‑R1‑Distill‑Qwen‑32B on benchmark tests.

2.2 Step 2 – Reproducing DeepSeek‑R1‑Zero

Direct reinforcement learning on Qwen2.5‑0.5B achieved ~51 % accuracy on GSM8k, a 10‑point gain over the base Qwen2.5‑0.5B‑Instruct model. The project is still early and no stable version is released yet.

Reinforcement on Qwen2.5‑0.5B
Reinforcement on Qwen2.5‑0.5B

3. Open‑Thoughts: UC‑Berkeley reproduction of DeepSeek‑Distill‑Qwen‑32B

UC‑Berkeley, Stanford and other institutions released OpenThinker‑32B , whose performance rivals DeepSeek‑Distill‑Qwen‑32B while using only 114k data (one‑eighth of the original dataset).

The key ideas are expanding data volume, strict verification of reasoning chains, and scaling model size. The project also open‑sourced code, data, and evaluation scripts.

OpenThinker‑32B evaluation results
OpenThinker‑32B evaluation results

4. LIMO: Less‑Is‑More for Reasoning

LIMO demonstrates that a tiny, carefully curated dataset of 817 high‑quality samples can outperform many large‑scale models on competition‑level math benchmarks. Using only 817 examples, LIMO achieved 57.1 % accuracy on AIME‑24, a 40.5 % absolute gain over models trained on 100× more data.

The hypothesis is that large language models already contain latent reasoning ability; the challenge is to awaken it with minimal, high‑quality examples.

Dataset construction :

Collected candidate problems from NuminaMath‑CoT (AIME, MATH, etc.).

Weak‑model filtering with Qwen2.5‑Math‑7B‑Instruct to discard trivially solved questions.

Strong‑model filtering with DeepSeek‑R1‑Distill‑Qwen‑32B to keep only hard‑to‑solve items.

Diversity sampling to balance difficulty, generalization, and knowledge coverage.

The final 817‑question set spans a wide range of math domains and includes multiple correct solutions per problem.

Answer generation combined official solutions, human‑expert edits, and model‑generated chains from DeepSeek‑R1, DeepSeek‑R1‑Distill‑Qwen‑32B, and Qwen2.5‑32B‑Instruct. Quality criteria emphasized clear structure, progressive understanding, and strict verification at each reasoning step.

LIMO vs. DeepSeek‑R1 vs. Qwen2.5‑32B response comparison
LIMO vs. DeepSeek‑R1 vs. Qwen2.5‑32B response comparison

5. DeepScaleR: Perfect reproduction of DeepSeek‑R1 RL effect

UC‑Berkeley reproduced DeepSeek‑R1‑Distilled‑Qwen‑1.5B using simple reinforcement learning (RL) on 4,500 USD of compute (3,800 A100‑GPU‑hours). The resulting DeepScaleR‑1.5B‑Preview outperformed OpenAI’s o1‑preview on several competition‑level math benchmarks.

Training strategy – short‑to‑long context :

Start with 8 K context RL training to balance efficiency and reasoning depth.

After ~1,000 steps, switch to 16 K context to avoid response truncation.

Finally extend to 24 K context, achieving a jump in AIME pass@1 accuracy from ~28 % to >43 %.

Performance boost with 817 samples
Performance boost with 817 samples

The reward model follows DeepSeek‑R1’s outcome‑reward design: return 1 if the answer passes LaTeX syntax and Sympy verification, otherwise 0.

Iterative length extension proved crucial: training first at 8 K kept response length around 3 k tokens and doubled training speed; later stages at 16 K and 24 K further improved accuracy while keeping token waste low.

Training reward and response length over steps
Training reward and response length over steps

6. Key Findings

Reinforcement learning can benefit small models when combined with high‑quality distilled SFT data.

Iterative context‑length expansion (8 K → 16 K → 24 K) yields more efficient RL training than starting with the longest context.

High‑quality, minimal datasets (e.g., 817 samples) can unlock reasoning potential comparable to models trained on orders of magnitude more data.

Strict verification pipelines (code execution, GPT‑4o‑mini filtering) dramatically improve data quality.

References

[1] Bespoke‑Stratos‑17k: https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k
[2] Bespoke‑Stratos‑32B: https://huggingface.co/bespokelabs/Bespoke-Stratos-32B
[3] Bespoke‑Stratos‑7B: https://huggingface.co/bespokelabs/Bespoke-Stratos-7B
[4] BAAI/TACO: https://huggingface.co/datasets/BAAI/TACO
[5] codeparrot/apps: https://huggingface.co/datasets/codeparrot/apps
[6] deepmind/code_contests: https://huggingface.co/datasets/deepmind/code_contests
[7] MatrixStudio/Codeforces-Python-Submissions: https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions
[8] AI-MO/NuminaMath-CoT: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT
[9] camel-ai/chemistry: https://huggingface.co/datasets/camel-ai/chemistry
[10] camel-ai/biology: https://huggingface.co/datasets/camel-ai/biology
[11] camel-ai/physics: https://huggingface.co/datasets/camel-ai/physics
[12] INK-USC/riddle_sense: https://huggingface.co/datasets/INK-USC/riddle_sense
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeepSeek-R1reinforcement learningdataset constructionAI researchModel Scalingopen-source LLM
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.