Why Sampling Noise, Not Train‑Inference Gap, Drives RL Instability in MOE Models

The article reveals that sampling noise, rather than train‑inference inconsistency, is the primary cause of reward collapse during RL training of MOE models, and demonstrates that suppressing this noise stabilizes training and speeds convergence.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
Why Sampling Noise, Not Train‑Inference Gap, Drives RL Instability in MOE Models

Technical Focus

The KAT‑Coder‑Pro V1 1210 update includes a study on reinforcement‑learning (RL) training stability for mixture‑of‑experts (MOE) models used in Agentic Coding.

Key Finding

Experiments show that the dominant cause of reward collapse is sampling noise, not the commonly assumed train‑inference mismatch. Suppressing the noise intensity stabilizes RL training even when train‑inference differences remain, and accelerates convergence.

Method

A noise‑suppression technique (referred to as mean_8) reduces the variance of sampled actions during policy updates. The method is compared against standard industry baselines such as TIS.

Results

Reward curves across multiple settings indicate that mean_8 yields higher and smoother rewards than TIS and other baselines.

Tool‑use error rates drop, leading to 89% success on the τ²‑Bench Telecom (Agentic Tool Use) benchmark.

Long‑context reasoning improves to 74% on the AA‑LCR benchmark.

Instruction‑following accuracy reaches 68% on IFBench, surpassing comparable models.

Implications

Reducing sampling noise provides more reliable RL training for MOE models, enhancing tool‑calling stability, code‑generation quality, and overall performance on diverse coding tasks.

Reference

Full technical details are available at:

https://kwaikat.github.io/kwaikat-blog/posts/katcoder_1201/
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI codingmodel stabilityRL trainingagentic codingMoE modelssampling noise
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.