Why Sampling Noise, Not Train‑Inference Gap, Drives RL Instability in MOE Models
The article reveals that sampling noise, rather than train‑inference inconsistency, is the primary cause of reward collapse during RL training of MOE models, and demonstrates that suppressing this noise stabilizes training and speeds convergence.
Technical Focus
The KAT‑Coder‑Pro V1 1210 update includes a study on reinforcement‑learning (RL) training stability for mixture‑of‑experts (MOE) models used in Agentic Coding.
Key Finding
Experiments show that the dominant cause of reward collapse is sampling noise, not the commonly assumed train‑inference mismatch. Suppressing the noise intensity stabilizes RL training even when train‑inference differences remain, and accelerates convergence.
Method
A noise‑suppression technique (referred to as mean_8) reduces the variance of sampled actions during policy updates. The method is compared against standard industry baselines such as TIS.
Results
Reward curves across multiple settings indicate that mean_8 yields higher and smoother rewards than TIS and other baselines.
Tool‑use error rates drop, leading to 89% success on the τ²‑Bench Telecom (Agentic Tool Use) benchmark.
Long‑context reasoning improves to 74% on the AA‑LCR benchmark.
Instruction‑following accuracy reaches 68% on IFBench, surpassing comparable models.
Implications
Reducing sampling noise provides more reliable RL training for MOE models, enhancing tool‑calling stability, code‑generation quality, and overall performance on diverse coding tasks.
Reference
Full technical details are available at:
https://kwaikat.github.io/kwaikat-blog/posts/katcoder_1201/Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
