Artificial Intelligence 39 min read

Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments

This note surveys four open‑source reproductions of DeepSeek R1/R1‑zero reinforcement‑learning pipelines, re‑implements their training on math and logic datasets using Qwen‑based models, shows that format‑plus‑accuracy rewards improve long‑chain reasoning though stability and scaling remain challenges, and outlines future directions for large‑scale RL and business deployment.

Tencent Technical Engineering

Feb 19, 2025

Reproduction and Analysis of DeepSeek R1/R1‑zero Reinforcement Learning Experiments

Introduction

Since the release of the DeepSeek R1 technical report, the open‑source community has produced many reproduction works. This note collects several open‑source projects, re‑implements the R1/R1‑zero reinforcement‑learning (RL) pipeline, evaluates the impact of the RL step on model performance, and discusses future prospects for R1 in large‑scale training and business deployment.

1. Open‑source R1 Projects Overview

The main open‑source R1 reproductions are summarized in Table 1. Four projects—SimpleRL, OpenR1, LogitRL, and TinyZero—are selected for further experiments based on their data domains (math, logic) and supported RL frameworks.

2. Experimental Setup

2.1 Training Data

Math datasets

SimpleRL: MATH‑8K (levels 3‑5), 8.5 K samples.

OpenR1: DigitalLearningGmbH/MATH‑lighteval (7.5 K) and AI‑MO/NuminaMath‑TIR (72.4 K) with step‑by‑step solutions generated by GPT‑4o.

TinyZero: Countdown (490 K) – arithmetic game where three‑ or four‑digit numbers must reach a target.

Logic datasets

LogicRL: Knights‑and‑Knaves (≈2 K) – classic truth‑telling puzzles.

2.2 Base Models

To keep the experiments reproducible, the following base models are used:

Qwen2.5‑7B‑Math (Base) – SimpleRL, OpenR1

Qwen2.5‑1.5B‑Instruct – OpenR1

DeepSeek‑R1‑Distill‑Qwen‑7B (Instruct) – OpenR1

Qwen2.5‑3B (Base) – TinyZero

Qwen2.5‑7B (Base) – LogicRL, TinyZero

Qwen2.5‑7B‑Instruct – LogicRL

The models are in the 1.5 B–7 B range, which balances training speed and reasoning capability.

2.3 RL Basic Settings

2.3.1 Reward Function Definitions

SimpleRL

System prompt:

Please reason step by step, and put your final answer within \boxed{}.

Reward snippet (format check):

if "boxed" not in model_output:<br/>    box_match = -1.0

OpenR1

System prompt:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.

Reward snippet (format check):

def format_reward(completions, **kwargs):<br/>    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"<br/>    completion_contents = [completion[0]["content"] for completion in completions]<br/>    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]<br/>    return [1.0 if match else 0.0 for match in matches]

LogicRL

Base system prompt:

The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the final answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags. Now the user asks you to solve a logical reasoning problem. After thinking, when you finally reach a conclusion, clearly state the identity of each character within <answer> </answer> tags.

Instruct system prompt:

You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.

Reward snippet (answer validation):

answer_score = 0<br/>if format_correct and answer_text:<br/>    pred_status = parse_model_answer(answer_text, expected_names)<br/>    if pred_status:<br/>        if pred_status == gt_status:<br/>            answer_score = 2<br/>            print("  Content validation: FULL MATCH")<br/>        else:<br/>            answer_score = -1.5<br/>            print("  Content validation: MISMATCH")<br/>    else:<br/>        answer_score = -2<br/>        print("Fail to parse answer")<br/>else:<br/>    answer_score = -2<br/>    print("
[Content Validation] Skipped due to format errors or missing answer")

TinyZero

System prompt snippet:

Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.

Reward snippet (extract answer):

answer_pattern = r'<answer>(.*?)</answer>'<br/>match = re.finditer(answer_pattern, solution_str)<br/>matches = list(match)<br/>if matches:<br/>    final_answer = matches[-1].group(1).strip()<br/>else:<br/>    final_answer = None<br/>return final_answer

2.3.2 Accuracy Reward Definitions

SimpleRL accuracy reward:

if qwen_math_equal_subprocess(prediction=extract_answer, reference=answer):<br/>    box_match = 1.0<br/>else:<br/>    box_match = -0.5

OpenR1 accuracy reward:

# Reward 1 if the content is the same as the ground truth, 0 otherwise<br/>reward = float(verify(answer_parsed, gold_parsed))

LogicRL accuracy reward (same as its format reward snippet above) and TinyZero accuracy reward (equation evaluation snippet):

try:<br/>    result = evaluate_equation(equation)<br/>    if result is None:<br/>        if do_print:<br/>            print(f"Could not evaluate equation")<br/>        return format_score<br/>    if abs(result - target) < 1e-5:<br/>        if do_print:<br/>            print(f"Correct equation: {equation} = {result}")<br/>        return score<br/>    else:<br/>        if do_print:<br/>            print(f"Wrong result: equation = {result}, target = {target}")<br/>        return format_score

2.3.3 Penalty Function (Optional)

Repetition penalty based on n‑gram coverage:

ngrams = set()<br/>total = 0<br/>for ng in zipngram(generation, ngram_size):<br/>    ngrams.add(ng)<br/>    total += 1<br/>scaling = 1 - len(ngrams) / total<br/>return scaling * max_penalty

2.3.4 Optimization Methods

Most projects support PPO; SimpleRL uses OpenRLHF (PPO), OpenR1 uses TRL (GRPO), LogitRL and TinyZero use VeRL (GRPO or PPO). Multi‑GPU and multi‑node training are limited to OpenRLHF.

2.3.5 Training Platform

All reproductions run on the TIONE platform with 1‑4 GPUs. Only OpenRLHF supports multi‑node training; others are single‑node multi‑GPU with typical 4‑8‑32 GPU configurations.

3. Results and Analysis

3.1 SimpleRL

Training on 32 GPUs (160 steps) and a comparison 8‑GPU run show similar convergence; multi‑node speed is ~3.2× faster. Reward and response length curves indicate a steady increase in test‑set performance while output length stabilises around 580 tokens (shorter than the original ~700). Tables show step‑wise improvements across GSM8K, MATH‑500, Minerva‑Math, OlympiadBench, AIME24, AMC23, with average scores rising from ~33.7 (base) to ~55.7 at step 100.

3.2 OpenR1

Format reward improves quickly, but DeepSeek‑Distill‑Qwen‑7B‑Instruct struggles with the required output format. Qwen2.5‑7B‑Base matches SimpleRL trends, while GRPO training on the DeepSeek‑Distill model shows fluctuations, suggesting sub‑optimal hyper‑parameters.

3.3 LogicRL (Three Stages)

Stage 1 (PPL = 3) reduces format errors dramatically but answer errors stay high (~60‑70%). Stage 2 (PPL = 5) is sensitive to temperature and rollout size; inappropriate settings cause training collapse (output length diverges, metrics drop). Proper hyper‑parameter tuning stabilises training, though GRPO still shows larger variance than PPO. Stage 3 adds a long‑learning‑rate annealing phase; both Base and Instruct models converge to similar accuracy, with the Instruct model achieving longer reasoning chains (≈1200 tokens) after “break‑then‑build”. Across stages, both models exhibit “aha moments” where step‑by‑step reasoning and self‑reflection appear, but the Instruct model tends to produce more concise chains.

3.4 TinyZero

PPO training on Qwen2.5‑7B‑Base and Qwen2.5‑3B‑Base shows that the 3 B model needs more steps to reach comparable performance and generates longer CoT tokens to compensate for lower reasoning capacity. Output length variations are driven mainly by CoT length; answer token count stays around 8‑16. GRPO training is noticeably less stable, with larger swings in response length and metric volatility.

4. Summary and Future Work

The reproductions demonstrate that RL fine‑tuning (format + accuracy rewards) can improve long‑chain reasoning on small LLMs, but current experiments remain toy‑scale. Future directions include:

Supporting large‑scale RL frameworks (PPO, GRPO) for multi‑node training.

Automated hyper‑parameter optimisation for stable GRPO.

Balanced RL data mixing across difficulty, domain, and task.

Acquiring high‑quality long‑CoT data for Instruct models.

Designing penalty functions that preserve CoT quality.

For business deployment, prompts must explicitly encode domain‑specific rules to avoid conflicts with pre‑training knowledge, and a mixture of general, domain‑specific, and math/code long‑CoT data should be used to solidify the reasoning foundation before RL.

5. References

simpleRL‑reason

open‑r1

Logic‑RL

TinyZero

demystify‑long‑CoT

deepscaleR (Notion)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model DeepSeek-R1 reinforcement learning long chain of thought open‑source replication

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.