Artificial Intelligence 22 min read

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

Architect

Mar 16, 2025

Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

Overview

This article provides a high‑level description of a full large‑language‑model (LLM) training pipeline and demonstrates a demo‑level reproduction of DeepSeek‑R1’s chain‑of‑thought (CoT) capability using a 0.5 B parameter model.

LLM Training Pipeline

Inference Mechanism

An LLM tokenizes input text, feeds the token sequence to a transformer, predicts the next‑token distribution, selects or samples a token, appends it to the context, and repeats until an end‑of‑sequence token is generated. The token stream is then detokenized back to natural language.

Pre‑training

Standard next‑token prediction on large web‑crawled corpora (e.g., Wikipedia, Baidu Baike) gives the model continuation ability, which is why raw outputs are continuation‑oriented rather than conversational.

Post‑Training (Fine‑tuning)

Three typical stages follow pre‑training:

Supervised Fine‑Tuning (SFT) – instruction tuning with high‑quality question‑answer pairs to align the model with human intent.

Reward Model (RM) – a classifier that scores model outputs, assigning higher scores to “chosen” (good) responses than to “rejected” (bad) ones.

Reinforcement Learning (RL) – uses the reward model to further improve the policy.

Supervised Fine‑Tuning (SFT)

SFT requires high‑quality data; better data yields behavior that matches the desired instruction. Early work relied on massive human annotation, while later projects often distill data from stronger models (e.g., GPT, DeepSeek) by prompting them to generate question‑answer pairs.

Reward Model (RM)

The RM is trained on triples {"prompt":..., "chosen":..., "rejected":...}. Its objective is to assign higher scores to the chosen response. A typical training sample looks like:

{
    "prompt": "你是谁",
    "chosen": "您好！我是由中国的深度求索（DeepSeek）公司开发的智能助手DeepSeek‑V3。",
    "rejected": "有什么可以帮您的？"
}

Reinforcement Learning Models

Policy Model (Actor) – predicts the next‑token distribution.

Value Model (Critic) – estimates the cumulative reward of a token sequence.

Reward Model – provides an immediate reward for a generated answer.

Reference Model – a frozen copy of the policy used to compute KL‑divergence constraints.

Direct Preference Optimization (DPO)

DPO removes the explicit reward model and directly optimizes the probability gap between chosen and rejected samples while keeping KL‑divergence to the reference model small.

Proximal Policy Optimization (PPO)

PPO follows the classic actor‑critic loop: sample actions, compute immediate reward with the RM, compute advantage with the value model, clip policy updates, and repeat for several epochs.

Generalized RL with Policy Optimization (GRPO)

GRPO, introduced in the DeepSeek‑R1 paper, removes the value model to reduce memory usage. The workflow:

Generate G samples with the policy.

Compute KL‑divergence against a reference model.

Optionally score with a reward model (or a simple correctness function).

Combine KL and reward scores as the loss to update the policy.

R1 Fine‑tuning Practice

Goal

Fine‑tune the open‑source Qwen2.5‑0.5B‑Instruct model so that it emits a chain‑of‑thought block ( <think>…</think>) followed by a concise answer block ( <answer>…</answer>).

<think>
思考过程...
</think>
<answer>
答案
</answer>

Dataset

A small elementary‑math dataset (similar to GSM8K) where each example already contains the desired XML‑like format, providing both reasoning and the final numeric answer.

Training Procedure

Run SFT on the first half of the dataset to teach the model the XML format.

Apply GRPO on the second half, using reward functions that (a) verify correct XML structure, (b) check numeric correctness, and (c) optionally reward well‑formedness.

Reward functions (Python) extract the <answer> field and compare it with the ground‑truth:

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [c[0]['content'] for c in completions]
    extracted = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted, answer)]

Trainer examples (using HuggingFace utilities):

trainer = SFTTrainer(
        model=model,
        processing_class=tokenizer,
        args=training_args,
        train_dataset=get_gsm8k_dataset(sft=True, first_half=True)
    )
trainer.train()

trainer = GRPOTrainer(
        model=model,
        processing_class=tokenizer,
        reward_funcs=reward_funcs,
        args=training_args,
        train_dataset=get_gsm8k_dataset(first_half=False)
    )
trainer.train()

Inference Example

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=256)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Assistant:
{response}")

Running the model on the question “Natalia sold clips to 22 of her friends …” yields:

<think>
In April, Natalia sold clips to 22 friends.
In May, she sold half as many clips as in April, which is 22/2 = 11 clips.
Altogether, Natalia sold 22+11 = 33 clips in April and May.
</think>
<answer>
33
</answer>

Findings

SFT alone can produce the CoT format, but answer accuracy may be limited because the loss focuses on overall fluency.

Skipping SFT and training directly with GRPO often fails to generate the required XML structure, making reward calculation impossible.

For conversational data, SFT is usually sufficient; for tasks requiring exact answers (math, code), combining SFT with reinforcement learning yields the best results.

Resources

https://github.com/QunBB/DeepLearning/tree/main/llms/train/deepseek-train

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Chain-of-Thought Reinforcement Learning LLM training GRPO Supervised Fine‑Tuning Reward Modeling

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.