Unlocking DeepSeek R1: Concepts, Training Secrets, and Real-World Experiments

This article demystifies DeepSeek R1 by explaining key concepts such as online search integration and the R1 model, detailing its two‑phase training pipeline, core techniques like iterative data enhancement, and showcases practical reproductions, benchmark tests, and deployment examples for AI developers.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
Unlocking DeepSeek R1: Concepts, Training Secrets, and Real-World Experiments

1. Concept Explanation (Beginner)

We open the DeepSeek website and notice two buttons below the chat box; this article explains their meanings, clarifies which model the popular "DeepSeek" refers to, and introduces the role of online search and the DeepSeek R1 reasoning model.

1. Online Search

Large language models (LLMs) are trained on offline data and therefore have a knowledge lag of six months to a year. Online search supplies up‑to‑date information, acting as an AI search engine that understands natural‑language queries rather than just keywords.

2. Deep Thinking (R1)

deepseek : a generic term for any model in the DeepSeek series. deepseek V3 : the latest base dialogue model (no deep reasoning), 671 B parameters, requires ~1300 GB VRAM for full deployment. deepseek R1 : a reasoning‑focused model praised for strong inference under limited resources; higher accuracy than V3 but slower reasoning. deepseek R1‑zero : an experimental predecessor of R1, used to explore RL‑driven reasoning. DeepSeek‑R1‑Distill‑Qwen‑xxxB : a knowledge‑distilled version trained on 800 k intermediate data from R1, fine‑tuned on Qwen 2.5.

2. Training Principle Analysis

1. Training Process

DeepSeek R1 training follows a two‑stage iterative optimization: generating high‑quality reasoning data and applying RL policies to improve logical inference.

Phase 1 (COT Data Quality Improvement)

Base model : DeepSeek V3 Base (pre‑trained). Training steps : • SFT – supervised fine‑tuning on initial reasoning data (e.g., CoT trajectories). • RL reinforcement – further optimization to produce Model RL‑1 , enhancing the quality of generated reasoning trajectories. Core purpose : Use Model RL‑1 to generate higher‑quality new CoT data, then discard Model RL‑1 and keep only the new data.

Phase 2 (Clean Base Retraining)

Base model reset : Return to the original DeepSeek V3 Base to avoid contaminating the base with low‑quality data. Data mixing : • New CoT data – high‑quality logical reasoning data selected via rejection sampling. • Post‑training data – non‑logic tasks from DeepSeek V3 to prevent catastrophic forgetting. Training flow : • Two epochs of SFT on the 800 k new data. • Two RL stages: 1) Enhance reasoning ability using rule‑based RL similar to R1‑zero. 2) Improve helpfulness and harmlessness following pipelines like DPSK‑v3.

Core Training Tricks

Iterative data augmentation – generate better data with the previous model for the next stage.

Base model reset each iteration – start from a clean base to avoid error accumulation.

Forgetting mitigation – mix logical and non‑logical data to maintain multi‑task balance.

2. Technical Value of DeepSeek R1

R1‑zero demonstrates that strong reasoning can emerge without SFT, using only RL prompts that ask the model to think before answering; repeated RL rounds produce longer responses and self‑reflection. Increasing training steps further lengthens responses and triggers reflective behavior. For small models, knowledge distillation yields larger reasoning gains than RL alone.

3. DeepSeek Reproduction – Practical Projects

1. High‑School Math Test

Dataset: 2024 Gaokao new‑curriculum math paper (19 questions, 150 points). Tested models: • Claude Sonnet 3.5 (direct input) • Claude Sonnet 3.5 + CoT • Claude Sonnet 3.5 + MCTS + CoT (Agent mode) • O1‑preview (direct input) • Qwen2.5‑Math‑72B (direct input) • DeepSeek‑R1 (direct input) • DeepSeek‑R1‑Distill‑Qwen‑32B (direct input)

2. deepscaler

UC Berkeley researchers fine‑tuned DeepSeek‑R1‑Distilled‑Qwen‑1.5B with simple RL to create DeepScaleR‑1.5B‑Preview. On the AIME2024 benchmark, Pass@1 reached 43.1 %, a 14.3 % improvement over the base model and surpassing OpenAI o1‑preview despite only 1.5 B parameters.

3. Logic‑RL

Reproduced by a USTC senior research group on the Logic Puzzle Dataset. After three‑stage rule‑based RL (without long CoT), the model learns to: • Hesitate (mark uncertain steps) • Explore multiple paths • Backtrack previous analysis • Provide staged summaries • Verify the final answer before responding.

4. Open R1

The HuggingFace team released a fully open‑source reproduction of DeepSeek‑R1, filling in previously undisclosed technical details.

4. Local Practice

1. Local Deployment and Product Use

Integrated DeepSeek‑R1‑Distill‑Qwen‑32B‑4bit into the 5starAI RAG assistant using vLLM. Deployment consumes ~1300 GB VRAM and runs at ~50 tokens/s.

RAG scenario built on the LlamaIndex framework; a code snippet illustrating the integration is shown below.

Application screenshots display the model’s reasoning process.

2. Reinforcement Learning Training Practice

Leveraging prior text2SQL experience, we plan to fine‑tune DeepSeek‑R1‑Distill‑Qwen‑1.5B on a text2SQL task using RL, following the deepscaler project workflow. Ongoing experiments aim to reproduce and improve the reported results.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeepSeeklarge language modelModel Trainingreinforcement learningknowledge distillation
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.