How X‑R1 Triggers Aha Moments in Low‑Cost RL Training of 0.5B LLMs
The X‑R1 open‑source framework demonstrates that a 0.5B language model can achieve rapid reasoning improvements and observable "Aha Moments" using reinforcement learning on a modest 4‑GPU setup, detailing its design, performance metrics, installation steps, and future roadmap.
Introduction
The X‑R1 repository ( https://github.com/dhcode-cpp/X-R1) aims to provide an easy‑to‑enter, low‑cost reinforcement‑learning (RL) training framework for scaling post‑training of large language models (LLMs). Inspired by DeepSeek‑R1 and open‑r1, the project reproduces the R1‑Zero "Aha Moment" on a 0.5B pretrained model for under 50 CNY.
Key Features of X‑R1
Runs on 4 × 3090/4090 GPUs, completing training in under 2 hours.
Effective with models as small as 0.5 B; supports scaling to 1.5 B, 7 B, and beyond.
Reduces data to 750 examples while still improving mathematical reasoning.
Focuses solely on pure Reasoning‑RL end‑to‑end training—no further pre‑training, instruction fine‑tuning, or data distillation.
Adds checkpoint sampling for easier observation of RL behavior.
0.5B Training Results
Setup and Execution
Training was performed on a 4 × 3090/4090 (24 GB) cluster using Zero‑Stage 3 optimization on three GPUs and vLLM inference on a fourth. The total training time for 3 epochs was approximately 1 hour 20 minutes.
ACCELERATE_LOG_LEVEL=info \
accelerate launch \
--config_file recipes/zero3.yaml \
--num_processes=3 \
src/x_r1/grpo.py \
--config recipes/X_R1_zero_0dot5B_config.yaml \
> ./output/x_r1_0dot5_sampling.log 2>&1Accuracy Reward
Experiments with 0.5B and 1.5B models produced the expected learning curves, reaching saturation in fewer than five optimization steps.
Aha Moment Observation
Approximately ten minutes into training, the model exhibited the "Aha Moment" phenomenon, where it recognized a mismatch in its reasoning and corrected its assumptions.
Findings Across Model Scales
0.5B: occasional Aha Moments.
1.5B: frequent Aha Moments, ~20 % higher performance scores than 0.5B.
7.0B: Aha Moments appear naturally after training on 100 examples and follow prompt formats.
70.0B: work in progress.
The team concludes that even very small models can trigger Aha Moments, and larger models benefit more from rule‑based reward signals, making RL‑driven reasoning easier to elicit. The focus remains on improving answer accuracy rather than investigating self‑reflection in the pretrained model.
Installation Guide
Only a CUDA version greater than 12.4 is required. The original open‑r1 dependencies are simplified using the uv tool, eliminating the need for an 8 × A100 (80 GB) setup.
git clone [email protected]:dhcode-cpp/X-R1.git
cd X-R1
mkdir output
conda create -n xr1 python=3.11
conda activate xr1
pip install -e .Future Plans
Support LoRA/QLoRA with train‑inference separation.
Release 7 B training configuration and results.
Add more rule‑based rewards.
Support additional base models.
Publish comprehensive benchmark results.
Contact & Acknowledgments
Issues and suggestions can be submitted to the X‑R1 repository or emailed to [email protected]. Thanks to the HuggingFace team for the Open‑R1 and TRL frameworks.
References
DeepSeek‑R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Qwen2.5 Technical Report
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
