Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

This article walks through the complete lifecycle of building a small large‑language model, covering token‑level inference, pre‑training, post‑training steps such as supervised fine‑tuning, reward‑model creation, and reinforcement‑learning methods like DPO, PPO and GRPO, culminating in a practical 0.5B model fine‑tuned for chain‑of‑thought reasoning.

Architect
Architect
Architect
Training a 0.5B LLM with Chain‑of‑Thought Reasoning: From Pre‑training to GRPO Fine‑tuning

Overview

This article provides a high‑level description of a full large‑language‑model (LLM) training pipeline and demonstrates a demo‑level reproduction of DeepSeek‑R1’s chain‑of‑thought (CoT) capability using a 0.5 B parameter model.

LLM Training Pipeline

Inference Mechanism

An LLM tokenizes input text, feeds the token sequence to a transformer, predicts the next‑token distribution, selects or samples a token, appends it to the context, and repeats until an end‑of‑sequence token is generated. The token stream is then detokenized back to natural language.

Pre‑training

Standard next‑token prediction on large web‑crawled corpora (e.g., Wikipedia, Baidu Baike) gives the model continuation ability, which is why raw outputs are continuation‑oriented rather than conversational.

Post‑Training (Fine‑tuning)

Three typical stages follow pre‑training:

Supervised Fine‑Tuning (SFT) – instruction tuning with high‑quality question‑answer pairs to align the model with human intent.

Reward Model (RM) – a classifier that scores model outputs, assigning higher scores to “chosen” (good) responses than to “rejected” (bad) ones.

Reinforcement Learning (RL) – uses the reward model to further improve the policy.

Post‑Training diagram
Post‑Training diagram

Supervised Fine‑Tuning (SFT)

SFT requires high‑quality data; better data yields behavior that matches the desired instruction. Early work relied on massive human annotation, while later projects often distill data from stronger models (e.g., GPT, DeepSeek) by prompting them to generate question‑answer pairs.

Reward Model (RM)

The RM is trained on triples {"prompt":..., "chosen":..., "rejected":...}. Its objective is to assign higher scores to the chosen response. A typical training sample looks like:

{
    "prompt": "你是谁",
    "chosen": "您好!我是由中国的深度求索(DeepSeek)公司开发的智能助手DeepSeek‑V3。",
    "rejected": "有什么可以帮您的?"
}
Reward Modeling
Reward Modeling

Reinforcement Learning Models

Policy Model (Actor) – predicts the next‑token distribution.

Value Model (Critic) – estimates the cumulative reward of a token sequence.

Reward Model – provides an immediate reward for a generated answer.

Reference Model – a frozen copy of the policy used to compute KL‑divergence constraints.

Direct Preference Optimization (DPO)

DPO removes the explicit reward model and directly optimizes the probability gap between chosen and rejected samples while keeping KL‑divergence to the reference model small.

DPO Loss
DPO Loss

Proximal Policy Optimization (PPO)

PPO follows the classic actor‑critic loop: sample actions, compute immediate reward with the RM, compute advantage with the value model, clip policy updates, and repeat for several epochs.

PPO training flow
PPO training flow

Generalized RL with Policy Optimization (GRPO)

GRPO, introduced in the DeepSeek‑R1 paper, removes the value model to reduce memory usage. The workflow:

Generate G samples with the policy.

Compute KL‑divergence against a reference model.

Optionally score with a reward model (or a simple correctness function).

Combine KL and reward scores as the loss to update the policy.

GRPO vs PPO
GRPO vs PPO

R1 Fine‑tuning Practice

Goal

Fine‑tune the open‑source Qwen2.5‑0.5B‑Instruct model so that it emits a chain‑of‑thought block ( <think>…</think>) followed by a concise answer block ( <answer>…</answer>).

<think>
思考过程...
</think>
<answer>
答案
</answer>

Dataset

A small elementary‑math dataset (similar to GSM8K) where each example already contains the desired XML‑like format, providing both reasoning and the final numeric answer.

Math dataset example
Math dataset example

Training Procedure

Run SFT on the first half of the dataset to teach the model the XML format.

Apply GRPO on the second half, using reward functions that (a) verify correct XML structure, (b) check numeric correctness, and (c) optionally reward well‑formedness.

Reward functions (Python) extract the <answer> field and compare it with the ground‑truth:

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [c[0]['content'] for c in completions]
    extracted = [extract_xml_answer(r) for r in responses]
    return [2.0 if r == a else 0.0 for r, a in zip(extracted, answer)]

Trainer examples (using HuggingFace utilities):

trainer = SFTTrainer(
        model=model,
        processing_class=tokenizer,
        args=training_args,
        train_dataset=get_gsm8k_dataset(sft=True, first_half=True)
    )
trainer.train()

trainer = GRPOTrainer(
        model=model,
        processing_class=tokenizer,
        reward_funcs=reward_funcs,
        args=training_args,
        train_dataset=get_gsm8k_dataset(first_half=False)
    )
trainer.train()

Inference Example

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=256)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Assistant:
{response}")

Running the model on the question “Natalia sold clips to 22 of her friends …” yields:

<think>
In April, Natalia sold clips to 22 friends.
In May, she sold half as many clips as in April, which is 22/2 = 11 clips.
Altogether, Natalia sold 22+11 = 33 clips in April and May.
</think>
<answer>
33
</answer>

Findings

SFT alone can produce the CoT format, but answer accuracy may be limited because the loss focuses on overall fluency.

Skipping SFT and training directly with GRPO often fails to generate the required XML structure, making reward calculation impossible.

For conversational data, SFT is usually sufficient; for tasks requiring exact answers (math, code), combining SFT with reinforcement learning yields the best results.

Resources

https://github.com/QunBB/DeepLearning/tree/main/llms/train/deepseek-train
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Chain-of-ThoughtReinforcement LearningLLM trainingGRPOSupervised Fine‑TuningReward Modeling
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.