How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation
The KAT‑Dev‑72B‑Exp model, an experimental reinforcement‑learning version of KAT‑Coder, achieves a 74.6% performance boost on the SWE‑Bench Verified benchmark, introduces Trie Packing and entropy‑aware advantage scaling, and showcases a decoupled training architecture that dramatically speeds up large‑scale agentic RL training.
Large‑scale reinforcement learning (RL) is a key pathway to unlocking complex reasoning and improving task generalization in large language models (LLMs). The KwaiPilot team recently released KAT‑Dev‑72B‑Exp, an experimental RL‑enhanced version of the KAT‑Coder model, which achieved a record‑breaking 74.6% performance on the SWE‑Bench Verified software‑development benchmark.
1. Trie Packing
The model is built on the proprietary SeamlessFlow industrial‑grade RL framework, which fully decouples training logic from the agent, supporting multi‑agent and online RL scenarios. To handle complex agent trajectories that form tree‑shaped token sequences, the team introduced a Trie Packing mechanism and re‑engineered the training engine, enabling efficient training on shared prefix trajectories.
2. Entropy‑Aware Advantage Scaling
In large‑scale LLM agentic training, token trajectories often exhibit a tree structure due to various TTS and memory mechanisms. Traditional approaches flatten these trees into independent linear sequences. KwaiPilot rewrote the training engine and attention kernel to perform tree‑gradient weight correction, merging repeated backward computations on shared prefixes, which increased training throughput by 2.5× on average.
The method also applies entropy‑aware advantage scaling: each rollout’s policy entropy is normalized and used as a scaling factor for the advantage, amplifying high‑uncertainty (exploratory) samples while suppressing low‑entropy ones, thereby improving the exploration‑exploitation balance.
3. Summary and Outlook
The team emphasizes the importance of a large‑scale, modular data environment that decouples training data, sandbox, and framework, allowing independent scaling of data sources and flexible framework switching. This architecture accelerates data expansion, supports diverse domains (code, mathematics, games, etc.), and enhances model robustness and generalization across unseen environments.
Open‑source resources:
KAT‑Dev‑72B‑Exp repository: https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
