Kuaishou Tech
Dec 19, 2025 · Artificial Intelligence
Why Sampling Noise, Not Train‑Inference Gap, Drives RL Instability in MOE Models
The article reveals that sampling noise, rather than train‑inference inconsistency, is the primary cause of reward collapse during RL training of MOE models, and demonstrates that suppressing this noise stabilizes training and speeds convergence.
AI codingMoE modelsRL training
0 likes · 6 min read
