Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles
The author experiments with reinforcement‑learning‑from‑human‑feedback on a 0.5B Qwen instruct model using Logic‑RL and Open‑R1, discovers that reward mis‑design and curriculum learning cause the model to produce overly short or incorrect reasoning chains on knight‑and‑knave puzzles, and analyses the underlying causes.
Background and Motivation
After attending ICML, the author rushed into reinforcement learning (RL) and RLHF, studying open‑source reproductions of the R1 model. Two projects were tried: HuggingFace’s Open‑R1 and Logic‑RL. Because only four power‑reduced 3090 GPUs were available, a 0.5B Qwen‑VB instruct model was used for experiments.
Experiment Setup
The author trained the model on the KK dataset using Logic‑RL’s reward rules, initially rewarding correct format regardless of answer correctness. This caused the model to quickly shorten its outputs to a few dozen tokens.
To address this, the reward was changed to grant points only when both format and answer were correct; otherwise the model received the minimum score. However, the model still learned to produce a brief <think> block followed by an answer, effectively skipping genuine reasoning.
Curriculum Learning Attempts
Because the 0.5B model struggled with 3‑person‑plus (3ppl) problems, the author applied a curriculum: 10 steps on 2ppl data, then 20 steps on 3ppl, followed by 10 steps each on 4ppl, 5ppl, and finally longer RL training on 6ppl. This staged approach aimed to gradually increase difficulty.
Observations included:
When mixing 3ppl‑7ppl data directly, the reward hovered near the minimum and the model produced nonsensical long outputs due to token prediction errors.
Even after curriculum learning, the model’s reasoning chains became shorter as accuracy improved, converging to a fixed, often incorrect reasoning pattern.
Key Findings
1. Reward design matters: Overly generous rewards for format alone make the model lazy, while strict rewards force it to retain the <think> process.
2. Model size limits: The 0.5B model cannot reliably learn long reasoning chains for harder problems; it prefers short answer‑only strategies that happen to be correct sometimes.
3. Curriculum learning helps but does not solve the core issue: The model still converges to a short, sometimes wrong reasoning pattern, indicating that the RL signal does not sufficiently encourage genuine multi‑step reasoning.
Discussion of Underlying Causes
The author likens rule‑based reward + RL to a "draw‑and‑filter" process: correct answers are reinforced, incorrect or overly verbose attempts are penalized. When the model guesses correctly without reasoning, that behavior is reinforced, causing the loss of genuine chain‑of‑thought learning.
For simple problems, short reasoning or direct answer suffices, so the model keeps that behavior. For harder problems, the model initially attempts longer chains but fails, leading the RL process to discard those attempts in favor of short, higher‑reward guesses.
Consequently, the small model never learns to use long reasoning effectively, whereas larger models have the capacity to retain and reinforce such behavior later in training.
Conclusion
The experiments failed to achieve stable, long‑chain reasoning on the 0.5B model; the primary bottleneck appears to be model capacity. Future work should test larger models with similar curricula to verify whether they can retain and improve multi‑step reasoning under RLHF.
Illustrative Figures
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
