Artificial Intelligence 6 min read

How X‑R1 Triggers Aha Moments in Low‑Cost RL Training of 0.5B LLMs

The X‑R1 open‑source framework demonstrates that a 0.5B language model can achieve rapid reasoning improvements and observable "Aha Moments" using reinforcement learning on a modest 4‑GPU setup, detailing its design, performance metrics, installation steps, and future roadmap.

Baobao Algorithm Notes

Feb 12, 2025

How X‑R1 Triggers Aha Moments in Low‑Cost RL Training of 0.5B LLMs

Introduction

The X‑R1 repository ( https://github.com/dhcode-cpp/X-R1) aims to provide an easy‑to‑enter, low‑cost reinforcement‑learning (RL) training framework for scaling post‑training of large language models (LLMs). Inspired by DeepSeek‑R1 and open‑r1, the project reproduces the R1‑Zero "Aha Moment" on a 0.5B pretrained model for under 50 CNY.

Key Features of X‑R1

Runs on 4 × 3090/4090 GPUs, completing training in under 2 hours.

Effective with models as small as 0.5 B; supports scaling to 1.5 B, 7 B, and beyond.

Reduces data to 750 examples while still improving mathematical reasoning.

Focuses solely on pure Reasoning‑RL end‑to‑end training—no further pre‑training, instruction fine‑tuning, or data distillation.

Adds checkpoint sampling for easier observation of RL behavior.

0.5B Training Results

Setup and Execution

Training was performed on a 4 × 3090/4090 (24 GB) cluster using Zero‑Stage 3 optimization on three GPUs and vLLM inference on a fourth. The total training time for 3 epochs was approximately 1 hour 20 minutes.

ACCELERATE_LOG_LEVEL=info \
accelerate launch \
--config_file recipes/zero3.yaml \
--num_processes=3 \
src/x_r1/grpo.py \
--config recipes/X_R1_zero_0dot5B_config.yaml \
> ./output/x_r1_0dot5_sampling.log 2>&1

Accuracy Reward

Experiments with 0.5B and 1.5B models produced the expected learning curves, reaching saturation in fewer than five optimization steps.

Aha Moment Observation

Approximately ten minutes into training, the model exhibited the "Aha Moment" phenomenon, where it recognized a mismatch in its reasoning and corrected its assumptions.

Findings Across Model Scales

0.5B: occasional Aha Moments.

1.5B: frequent Aha Moments, ~20 % higher performance scores than 0.5B.

7.0B: Aha Moments appear naturally after training on 100 examples and follow prompt formats.

70.0B: work in progress.

The team concludes that even very small models can trigger Aha Moments, and larger models benefit more from rule‑based reward signals, making RL‑driven reasoning easier to elicit. The focus remains on improving answer accuracy rather than investigating self‑reflection in the pretrained model.

Installation Guide

Only a CUDA version greater than 12.4 is required. The original open‑r1 dependencies are simplified using the uv tool, eliminating the need for an 8 × A100 (80 GB) setup.

git clone [email protected]:dhcode-cpp/X-R1.git
cd X-R1
mkdir output
conda create -n xr1 python=3.11
conda activate xr1
pip install -e .

Future Plans

Support LoRA/QLoRA with train‑inference separation.

Release 7 B training configuration and results.

Add more rule‑based rewards.

Support additional base models.

Publish comprehensive benchmark results.

Contact & Acknowledgments

Issues and suggestions can be submitted to the X‑R1 repository or emailed to [email protected]. Thanks to the HuggingFace team for the Open‑R1 and TRL frameworks.

References

DeepSeek‑R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Qwen2.5 Technical Report

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM open source Reinforcement Learning Training Framework

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.