Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments
This note surveys four open‑source reproductions of DeepSeek R1/R1‑zero reinforcement‑learning pipelines, re‑implements their training on math and logic datasets using Qwen‑based models, shows that format‑plus‑accuracy rewards improve long‑chain reasoning though stability and scaling remain challenges, and outlines future directions for large‑scale RL and business deployment.
Introduction
Since the release of the DeepSeek R1 technical report, the open‑source community has produced many reproduction works. This note collects several open‑source projects, re‑implements the R1/R1‑zero reinforcement‑learning (RL) pipeline, evaluates the impact of the RL step on model performance, and discusses future prospects for R1 in large‑scale training and business deployment.
1. Open‑source R1 Projects Overview
The main open‑source R1 reproductions are summarized in Table 1. Four projects—SimpleRL, OpenR1, LogitRL, and TinyZero—are selected for further experiments based on their data domains (math, logic) and supported RL frameworks.
2. Experimental Setup
2.1 Training Data
Math datasets
SimpleRL: MATH‑8K (levels 3‑5), 8.5 K samples.
OpenR1: DigitalLearningGmbH/MATH‑lighteval (7.5 K) and AI‑MO/NuminaMath‑TIR (72.4 K) with step‑by‑step solutions generated by GPT‑4o.
TinyZero: Countdown (490 K) – arithmetic game where three‑ or four‑digit numbers must reach a target.
Logic datasets
LogicRL: Knights‑and‑Knaves (≈2 K) – classic truth‑telling puzzles.
2.2 Base Models
To keep the experiments reproducible, the following base models are used:
Qwen2.5‑7B‑Math (Base) – SimpleRL, OpenR1
Qwen2.5‑1.5B‑Instruct – OpenR1
DeepSeek‑R1‑Distill‑Qwen‑7B (Instruct) – OpenR1
Qwen2.5‑3B (Base) – TinyZero
Qwen2.5‑7B (Base) – LogicRL, TinyZero
Qwen2.5‑7B‑Instruct – LogicRL
The models are in the 1.5 B–7 B range, which balances training speed and reasoning capability.
2.3 RL Basic Settings
2.3.1 Reward Function Definitions
SimpleRL
System prompt:
Please reason step by step, and put your final answer within \boxed{}.Reward snippet (format check):
if "boxed" not in model_output:
box_match = -1.0OpenR1
System prompt:
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.Reward snippet (format check):
def format_reward(completions, **kwargs):
pattern = r"^
.*?
\s*
.*?
$"
completion_contents = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
return [1.0 if match else 0.0 for match in matches]LogicRL
Base system prompt:
The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the final answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags. Now the user asks you to solve a logical reasoning problem. After thinking, when you finally reach a conclusion, clearly state the identity of each character within <answer> </answer> tags.Instruct system prompt:
You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.Reward snippet (answer validation):
answer_score = 0
if format_correct and answer_text:
pred_status = parse_model_answer(answer_text, expected_names)
if pred_status:
if pred_status == gt_status:
answer_score = 2
print(" Content validation: FULL MATCH")
else:
answer_score = -1.5
print(" Content validation: MISMATCH")
else:
answer_score = -2
print("Fail to parse answer")
else:
answer_score = -2
print("\n[Content Validation] Skipped due to format errors or missing answer")TinyZero
System prompt snippet:
Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.Reward snippet (extract answer):
answer_pattern = r'
(.*?)
'
match = re.finditer(answer_pattern, solution_str)
matches = list(match)
if matches:
final_answer = matches[-1].group(1).strip()
else:
final_answer = None
return final_answer2.3.2 Accuracy Reward Definitions
SimpleRL accuracy reward:
if qwen_math_equal_subprocess(prediction=extract_answer, reference=answer):
box_match = 1.0
else:
box_match = -0.5OpenR1 accuracy reward:
# Reward 1 if the content is the same as the ground truth, 0 otherwise
reward = float(verify(answer_parsed, gold_parsed))LogicRL accuracy reward (same as its format reward snippet above) and TinyZero accuracy reward (equation evaluation snippet):
try:
result = evaluate_equation(equation)
if result is None:
if do_print:
print(f"Could not evaluate equation")
return format_score
if abs(result - target) < 1e-5:
if do_print:
print(f"Correct equation: {equation} = {result}")
return score
else:
if do_print:
print(f"Wrong result: equation = {result}, target = {target}")
return format_score2.3.3 Penalty Function (Optional)
Repetition penalty based on n‑gram coverage:
ngrams = set()
total = 0
for ng in zipngram(generation, ngram_size):
ngrams.add(ng)
total += 1
scaling = 1 - len(ngrams) / total
return scaling * max_penalty2.3.4 Optimization Methods
Most projects support PPO; SimpleRL uses OpenRLHF (PPO), OpenR1 uses TRL (GRPO), LogitRL and TinyZero use VeRL (GRPO or PPO). Multi‑GPU and multi‑node training are limited to OpenRLHF.
2.3.5 Training Platform
All reproductions run on the TIONE platform with 1‑4 GPUs. Only OpenRLHF supports multi‑node training; others are single‑node multi‑GPU with typical 4‑8‑32 GPU configurations.
3. Results and Analysis
3.1 SimpleRL
Training on 32 GPUs (160 steps) and a comparison 8‑GPU run show similar convergence; multi‑node speed is ~3.2× faster. Reward and response length curves indicate a steady increase in test‑set performance while output length stabilises around 580 tokens (shorter than the original ~700). Tables show step‑wise improvements across GSM8K, MATH‑500, Minerva‑Math, OlympiadBench, AIME24, AMC23, with average scores rising from ~33.7 (base) to ~55.7 at step 100.
3.2 OpenR1
Format reward improves quickly, but DeepSeek‑Distill‑Qwen‑7B‑Instruct struggles with the required output format. Qwen2.5‑7B‑Base matches SimpleRL trends, while GRPO training on the DeepSeek‑Distill model shows fluctuations, suggesting sub‑optimal hyper‑parameters.
3.3 LogicRL (Three Stages)
Stage 1 (PPL = 3) reduces format errors dramatically but answer errors stay high (~60‑70%). Stage 2 (PPL = 5) is sensitive to temperature and rollout size; inappropriate settings cause training collapse (output length diverges, metrics drop). Proper hyper‑parameter tuning stabilises training, though GRPO still shows larger variance than PPO. Stage 3 adds a long‑learning‑rate annealing phase; both Base and Instruct models converge to similar accuracy, with the Instruct model achieving longer reasoning chains (≈1200 tokens) after “break‑then‑build”. Across stages, both models exhibit “aha moments” where step‑by‑step reasoning and self‑reflection appear, but the Instruct model tends to produce more concise chains.
3.4 TinyZero
PPO training on Qwen2.5‑7B‑Base and Qwen2.5‑3B‑Base shows that the 3 B model needs more steps to reach comparable performance and generates longer CoT tokens to compensate for lower reasoning capacity. Output length variations are driven mainly by CoT length; answer token count stays around 8‑16. GRPO training is noticeably less stable, with larger swings in response length and metric volatility.
4. Summary and Future Work
The reproductions demonstrate that RL fine‑tuning (format + accuracy rewards) can improve long‑chain reasoning on small LLMs, but current experiments remain toy‑scale. Future directions include:
Supporting large‑scale RL frameworks (PPO, GRPO) for multi‑node training.
Automated hyper‑parameter optimisation for stable GRPO.
Balanced RL data mixing across difficulty, domain, and task.
Acquiring high‑quality long‑CoT data for Instruct models.
Designing penalty functions that preserve CoT quality.
For business deployment, prompts must explicitly encode domain‑specific rules to avoid conflicts with pre‑training knowledge, and a mixture of general, domain‑specific, and math/code long‑CoT data should be used to solidify the reasoning foundation before RL.
5. References
simpleRL‑reason
open‑r1
Logic‑RL
TinyZero
demystify‑long‑CoT
deepscaleR (Notion)
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.