How Open‑Source Projects Reproduced DeepSeek‑R1 and Pushed LLM Limits
This article reviews the most notable open‑source reproductions of DeepSeek‑R1—including Open R1, OpenThoughts, LIMO and DeepScaleR—detailing their data pipelines, training steps, reinforcement‑learning strategies, dataset constructions, and benchmark results that demonstrate how small, high‑quality data can rival massive‑scale models.
Since DeepSeek‑R1 was released it quickly became a national‑level product, sparking a wave of open‑source reproductions. This article aggregates the most notable implementations, analyzes their pipelines, datasets, training methods, and performance.
1. Open R1: Full‑process reproduction on HuggingFace
Open R1, initiated by HuggingFace, aims to fill the missing parts of the R1 pipeline so anyone can reproduce and build on it. The project follows three steps:
Step 1: Distill high‑quality data from DeepSeek‑R1 to create the Bespoke‑Stratos‑17k dataset.
Step 2: Reproduce the pure‑reinforcement‑learning training of R1‑Zero, including inference‑data generation.
Step 3: Reproduce the full R1 training pipeline, including two‑stage SFT and two‑stage RL.
Step 1 – Data creation : Using the Bespoke Curator project, Bespoke‑Stratos‑17k was built in 1.5 hours for $800. It consists of 5k APPs + TACO, 10k NuminaMATH (AIME, MATH, Olympiads), and 1k STILL‑2 scientific and puzzle data. The construction pipeline includes:
Generating synthetic data with Bespoke Curator.
Reject‑sampling to filter out incorrect solution trajectories, accelerated with a Ray cluster.
Applying GPT‑4o‑mini to filter wrong math solutions, raising the correct‑solution ratio from 25 % to 73 %.
Training on this data produced Bespoke‑Stratos‑32B and Bespoke‑Stratos‑7B models, whose Distill‑Qwen‑32B performance is very close to DeepSeek‑R1‑Distill‑Qwen‑32B (see Figure 2).
2.1 Step 1 – Reproducing DeepSeek‑R1‑Distill
The distilled dataset Bespoke‑Stratos‑17k enabled training of Bespoke‑Stratos‑32B and Bespoke‑Stratos‑7B . Their performance matches the original DeepSeek‑R1‑Distill‑Qwen‑32B on benchmark tests.
2.2 Step 2 – Reproducing DeepSeek‑R1‑Zero
Direct reinforcement learning on Qwen2.5‑0.5B achieved ~51 % accuracy on GSM8k, a 10‑point gain over the base Qwen2.5‑0.5B‑Instruct model. The project is still early and no stable version is released yet.
3. Open‑Thoughts: UC‑Berkeley reproduction of DeepSeek‑Distill‑Qwen‑32B
UC‑Berkeley, Stanford and other institutions released OpenThinker‑32B , whose performance rivals DeepSeek‑Distill‑Qwen‑32B while using only 114k data (one‑eighth of the original dataset).
The key ideas are expanding data volume, strict verification of reasoning chains, and scaling model size. The project also open‑sourced code, data, and evaluation scripts.
4. LIMO: Less‑Is‑More for Reasoning
LIMO demonstrates that a tiny, carefully curated dataset of 817 high‑quality samples can outperform many large‑scale models on competition‑level math benchmarks. Using only 817 examples, LIMO achieved 57.1 % accuracy on AIME‑24, a 40.5 % absolute gain over models trained on 100× more data.
The hypothesis is that large language models already contain latent reasoning ability; the challenge is to awaken it with minimal, high‑quality examples.
Dataset construction :
Collected candidate problems from NuminaMath‑CoT (AIME, MATH, etc.).
Weak‑model filtering with Qwen2.5‑Math‑7B‑Instruct to discard trivially solved questions.
Strong‑model filtering with DeepSeek‑R1‑Distill‑Qwen‑32B to keep only hard‑to‑solve items.
Diversity sampling to balance difficulty, generalization, and knowledge coverage.
The final 817‑question set spans a wide range of math domains and includes multiple correct solutions per problem.
Answer generation combined official solutions, human‑expert edits, and model‑generated chains from DeepSeek‑R1, DeepSeek‑R1‑Distill‑Qwen‑32B, and Qwen2.5‑32B‑Instruct. Quality criteria emphasized clear structure, progressive understanding, and strict verification at each reasoning step.
5. DeepScaleR: Perfect reproduction of DeepSeek‑R1 RL effect
UC‑Berkeley reproduced DeepSeek‑R1‑Distilled‑Qwen‑1.5B using simple reinforcement learning (RL) on 4,500 USD of compute (3,800 A100‑GPU‑hours). The resulting DeepScaleR‑1.5B‑Preview outperformed OpenAI’s o1‑preview on several competition‑level math benchmarks.
Training strategy – short‑to‑long context :
Start with 8 K context RL training to balance efficiency and reasoning depth.
After ~1,000 steps, switch to 16 K context to avoid response truncation.
Finally extend to 24 K context, achieving a jump in AIME pass@1 accuracy from ~28 % to >43 %.
The reward model follows DeepSeek‑R1’s outcome‑reward design: return 1 if the answer passes LaTeX syntax and Sympy verification, otherwise 0.
Iterative length extension proved crucial: training first at 8 K kept response length around 3 k tokens and doubled training speed; later stages at 16 K and 24 K further improved accuracy while keeping token waste low.
6. Key Findings
Reinforcement learning can benefit small models when combined with high‑quality distilled SFT data.
Iterative context‑length expansion (8 K → 16 K → 24 K) yields more efficient RL training than starting with the longest context.
High‑quality, minimal datasets (e.g., 817 samples) can unlock reasoning potential comparable to models trained on orders of magnitude more data.
Strict verification pipelines (code execution, GPT‑4o‑mini filtering) dramatically improve data quality.
References
[1] Bespoke‑Stratos‑17k: https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k
[2] Bespoke‑Stratos‑32B: https://huggingface.co/bespokelabs/Bespoke-Stratos-32B
[3] Bespoke‑Stratos‑7B: https://huggingface.co/bespokelabs/Bespoke-Stratos-7B
[4] BAAI/TACO: https://huggingface.co/datasets/BAAI/TACO
[5] codeparrot/apps: https://huggingface.co/datasets/codeparrot/apps
[6] deepmind/code_contests: https://huggingface.co/datasets/deepmind/code_contests
[7] MatrixStudio/Codeforces-Python-Submissions: https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions
[8] AI-MO/NuminaMath-CoT: https://huggingface.co/datasets/AI-MO/NuminaMath-CoT
[9] camel-ai/chemistry: https://huggingface.co/datasets/camel-ai/chemistry
[10] camel-ai/biology: https://huggingface.co/datasets/camel-ai/biology
[11] camel-ai/physics: https://huggingface.co/datasets/camel-ai/physics
[12] INK-USC/riddle_sense: https://huggingface.co/datasets/INK-USC/riddle_senseSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
