Machine Learning Algorithms & Natural Language Processing
Mar 21, 2026 · Artificial Intelligence
Unsupervised RL for Large Models: How Far Can It Scale? Tsinghua’s Systematic Study
The paper analyzes unsupervised reinforcement learning for large language models, revealing that intrinsic reward methods initially boost performance but inevitably collapse due to confidence‑correctness misalignment, proposes a model‑collapse step metric to predict RL suitability, and argues that external, verification‑based rewards are the scalable path forward.
Large language modelsexternal verification rewardintrinsic reward
0 likes · 12 min read
