Build DeepSeek‑R1 from Scratch: Complete Training Process with Code Walkthrough
This article provides a step‑by‑step, code‑first guide to reproducing DeepSeek‑R1 from the ground up, covering model selection, dataset preparation, custom reward functions, GRPO reinforcement‑learning training, supervised fine‑tuning, reasoning‑oriented RL, rejection sampling, and model distillation.
The tutorial starts by stating that DeepSeek‑R1 is trained on top of the DeepSeek‑V3 base model using reinforcement learning (RL). For reproducibility the author replaces the original 685 GB base with the lightweight Qwen/Qwen2.5-0.5B‑Instruct model (or the larger 7 B variant if GPU memory permits).
Environment setup
git clone https://github.com/FareedKhan-dev/train-deepseek-r1.git
cd train-deepseek-r1
pip install -r requirements.txtEssential libraries are imported, including torch, transformers, datasets, and the TRL (Transformers Reinforcement Learning) package.
Dataset preparation
Two open‑source math datasets are used: AI‑MO/NuminaMath‑TIR (≈70 K problems with chain‑of‑thought annotations) and bespokelabs/Bespoke‑Stratos‑17k (≈17 K math/code problems). Example loading code:
# Load NuminaMath‑TIR
MATH_le = load_dataset("AI-MO/NuminaMath-TIR", "default")
print(MATH_le['train'][0])
# Load Bespoke‑Stratos‑17k
bespoke_rl = load_dataset("bespokelabs/Bespoke-Stratos-17k", "default")
print(bespoke_rl['train'][0])A helper make_conversation converts each example into a list of messages with a system prompt, user problem, and optional answer, then the dataset is validated for required fields.
def make_conversation(example):
return {"prompt": [{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": example["problem"]}]}
def validate_dataset(dataset):
for split in ["train", "test"]:
fields = dataset[split].column_names
assert "problem" in fields and "prompt" in fields
sample = dataset[split][0]
msgs = sample["prompt"]
assert msgs[0]["role"] == "system" and msgs[1]["role"] == "user"Model inspection
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
OUTPUT_DIR = "data/Qwen-GRPO-training"
os.makedirs(OUTPUT_DIR, exist_ok=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True, padding_side="right")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print(f"Vocabulary size: {len(tokenizer)}")
print(f"Model max length: {tokenizer.model_max_length}")
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True, torch_dtype=torch.bfloat16)
print(f"Model parameters: {model.num_parameters():,}")A quick inference test confirms the model runs correctly.
def test_model_inference(user_input):
messages = [{"role": "system", "content": "You are Qwen, a helpful assistant."},
{"role": "user", "content": user_input}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(test_model_inference("how are you?"))Reward functions for GRPO
Five reward components are implemented:
Accuracy – checks mathematical equivalence using latex2sympy2 and math_verify.
Format – enforces
<think>…</think> <answer>…</answer>structure via regex.
Reasoning steps – counts step‑indicators like "Step 1:" or bullet points.
Cosine scaling – adjusts the accuracy reward based on output length.
Repetition penalty – penalises repeated n‑grams.
# Accuracy reward
def accuracy_reward(completions, solution, **kwargs):
rewards = []
for content, sol in zip([c[0]["content"] for c in completions], solution):
gold = parse(sol, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
if gold:
ans = parse(content, extraction_config=[LatexExtractionConfig(normalization_config=NormalizationConfig(...))], extraction_mode="first_match")
rewards.append(float(verify(ans, gold)))
else:
rewards.append(0.5)
return rewards
# Format reward
def format_reward(completions, **kwargs):
pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
return [1.0 if re.match(pattern, c[0]["content"], re.DOTALL|re.MULTILINE) else 0.0 for c in completions]
# Reasoning steps reward
def reasoning_steps_reward(completions, **kwargs):
pattern = r"(Step \d+:|^\d+\.|
-|
\*|First,|Second,|Next,|Finally,)"
counts = [len(re.findall(pattern, c[0]["content"], re.MULTILINE)) for c in completions]
return [min(1.0, cnt/3) for cnt in counts]
# Cosine scaled reward factory
def get_cosine_scaled_reward(min_value_wrong=-0.5, max_value_wrong=-0.1, min_value_correct=0.8, max_value_correct=1.0, max_len=1000):
def cosine_scaled_reward(completions, solution, accuracy_rewards, **kwargs):
rewards = []
for content, acc in zip([c[0]["content"] for c in completions], accuracy_rewards):
gen_len = len(content)
progress = gen_len / max_len
cosine = math.cos(progress * math.pi)
if acc > 0.5:
min_v, max_v = min_value_correct, max_value_correct
else:
min_v, max_v = max_value_wrong, min_value_wrong
reward = min_v + 0.5 * (max_v - min_v) * (1.0 + cosine)
rewards.append(float(reward))
return rewards
return cosine_scaled_reward
# Repetition penalty reward factory
def get_repetition_penalty_reward(ngram_size=3, max_penalty=-0.1):
def repetition_penalty_reward(completions, **kwargs):
rewards = []
for comp in [c[0]["content"] for c in completions]:
if not comp or len(comp.split()) < ngram_size:
rewards.append(0.0)
continue
ngrams = set()
total = 0
for ng in zip(*[comp.lower().split()[i:] for i in range(ngram_size)]):
ngrams.add(ng)
total += 1
scaling = 1 - len(ngrams) / total
rewards.append(scaling * max_penalty)
return rewards
return repetition_penalty_rewardGRPO training configuration
# Training arguments (shared with SFT later)
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=16,
gradient_accumulation_steps=2,
learning_rate=5e-5,
warmup_ratio=0.1,
weight_decay=0.01,
logging_steps=10,
evaluation_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=50,
save_total_limit=2,
dataloader_num_workers=2,
seed=42,
bf16=True,
push_to_hub=False,
gradient_checkpointing=True,
report_to="none",
)
# Assemble reward functions according to script arguments
script_args = GRPOScriptArguments()
reward_functions = get_reward_functions(script_args)
# GRPO trainer initialization
grpo_config = GRPOConfig(**training_args.to_dict())
grpo_trainer = GRPOTrainer(
model=model,
reward_funcs=reward_functions,
args=grpo_config,
train_dataset=dataset['train'],
eval_dataset=dataset['test'],
callbacks=[LoggingCallback()],
)
# Start training
train_result = grpo_trainer.train()
print(train_result)Training logs show loss and learning‑rate at each logging step.
R1‑Zero issues
The author notes two main problems of the original R1‑Zero model: (1) the <think> block is hard for humans to read, and (2) language mixing occurs when the model receives multilingual prompts, leading to mixed‑language outputs.
Cold‑start data creation
To address the issues, the tutorial demonstrates two prompting strategies:
Few‑shot Long Chain‑of‑Thought prompting – a few solved examples are supplied with detailed reasoning, using a custom delimiter <|special_token|> to separate steps.
Direct prompting – the user explicitly asks the model to solve, show reasoning, and verify the answer.
# Few‑shot example
few_shot_prompt = """
Problem: What's the square root of 9 plus 5?
Solution: <|special_token|> First, find the square root of 9, which is 3. Then, add 5 to 3. 3 + 5 equals 8. <|special_token|> Summary: The answer is 8.
Problem: Train travels at 60 mph for 2 hours, how far?
Solution: <|special_token|> Distance = Speed * Time = 60 * 2 = 120 miles. <|special_token|> Summary: 120 miles.
Problem: What is 2 + 3 * 4?
Solution:
"""
model_response_few_shot = generate_response(few_shot_prompt + "What is 2 + 3 * 4?")
print(model_response_few_shot)
# Direct prompt
direct_prompt = """Problem: Solve this, show reasoning step‑by‑step, and verify:
What is 2 + 3 * 4?"""
model_response_direct = generate_response(direct_prompt)
print(model_response_direct)Both approaches produce structured outputs with reasoning steps and a final summary.
Post‑processing refinement
A simple refine_output function demonstrates how a messy R1‑Zero output can be cleaned into the desired <|special_token|> format.
def refine_output(messy_text):
think = messy_text.split("<think>")[1].split("</think>")[0].strip()
answer = messy_text.split("<answer>")[1].split("</answer>")[0].strip()
return f"<|special_token|> Reasoning: {think.capitalize()}.
<|special_token|> Summary: The answer is {answer}."
messy = "<think> ummm... multiply 3 and 4... get 12... then add 2...</think>
<answer> 14 </answer>"
print(refine_output(messy))Supervised Fine‑Tuning (SFT)
Using the same Bespoke‑Stratos‑17k dataset, the author fine‑tunes the base model to produce clean reasoning outputs.
# SFT training arguments (slightly different learning rate, packing enabled)
training_args_sft = TrainingArguments(
output_dir="data/Qwen-SFT-training",
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=16,
gradient_accumulation_steps=2,
learning_rate=2e-5,
warmup_ratio=0.1,
weight_decay=0.01,
logging_steps=10,
evaluation_strategy="no",
save_strategy="steps",
save_steps=50,
save_total_limit=2,
dataloader_num_workers=2,
seed=42,
bf16=True,
push_to_hub=False,
gradient_checkpointing=True,
report_to="none",
packing=True,
max_seq_length=4096,
)
model_sft = AutoModelForCausalLM.from_pretrained(MODEL_NAME, trust_remote_code=True, torch_dtype=torch.bfloat16)
sft_trainer = SFTTrainer(
model=model_sft,
train_dataset=dataset_sft,
tokenizer=tokenizer,
args=training_args_sft,
dataset_text_field="conversations",
packing=True,
max_seq_length=4096,
)
sft_train_result = sft_trainer.train()
print(sft_train_result)After SFT, the model (now called R1) generates clearer, step‑by‑step solutions without language mixing.
Reasoning‑oriented RL
The next RL stage re‑introduces GRPO but adds a language‑consistency reward to ensure the output language matches the query language, further improving the model’s reasoning quality.
Rejection sampling
High‑quality reasoning examples are filtered using a rejection‑sampling pipeline: many candidates are generated, evaluated by a reward model and optionally by human annotators, and only the best are kept for a second round of SFT.
Distillation
Finally, the large R1 model serves as a teacher for knowledge distillation into a smaller student model, preserving most of the reasoning ability while reducing inference cost.
The article ends with a reminder that the code is for academic learning only and provides the original source link.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
