Baobao Algorithm Notes
Mar 5, 2025 · Artificial Intelligence
Why My 0.5B LLM’s Reasoning Collapsed During RLHF on Logic Puzzles
The author experiments with reinforcement‑learning‑from‑human‑feedback on a 0.5B Qwen instruct model using Logic‑RL and Open‑R1, discovers that reward mis‑design and curriculum learning cause the model to produce overly short or incorrect reasoning chains on knight‑and‑knave puzzles, and analyses the underlying causes.
Artificial IntelligenceLarge Language ModelLogic Reasoning
0 likes · 11 min read
