Artificial Intelligence 9 min read

How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation

The KAT‑Dev‑72B‑Exp model, an experimental reinforcement‑learning version of KAT‑Coder, achieves a 74.6% performance boost on the SWE‑Bench Verified benchmark, introduces Trie Packing and entropy‑aware advantage scaling, and showcases a decoupled training architecture that dramatically speeds up large‑scale agentic RL training.

Kuaishou Tech

Oct 11, 2025

How KAT-Dev-72B-Exp Sets a New Record in Large‑Scale RL for Code Generation

Large‑scale reinforcement learning (RL) is a key pathway to unlocking complex reasoning and improving task generalization in large language models (LLMs). The KwaiPilot team recently released KAT‑Dev‑72B‑Exp, an experimental RL‑enhanced version of the KAT‑Coder model, which achieved a record‑breaking 74.6% performance on the SWE‑Bench Verified software‑development benchmark.

1. Trie Packing

The model is built on the proprietary SeamlessFlow industrial‑grade RL framework, which fully decouples training logic from the agent, supporting multi‑agent and online RL scenarios. To handle complex agent trajectories that form tree‑shaped token sequences, the team introduced a Trie Packing mechanism and re‑engineered the training engine, enabling efficient training on shared prefix trajectories.

2. Entropy‑Aware Advantage Scaling

In large‑scale LLM agentic training, token trajectories often exhibit a tree structure due to various TTS and memory mechanisms. Traditional approaches flatten these trees into independent linear sequences. KwaiPilot rewrote the training engine and attention kernel to perform tree‑gradient weight correction, merging repeated backward computations on shared prefixes, which increased training throughput by 2.5× on average.

The method also applies entropy‑aware advantage scaling: each rollout’s policy entropy is normalized and used as a scaling factor for the advantage, amplifying high‑uncertainty (exploratory) samples while suppressing low‑entropy ones, thereby improving the exploration‑exploitation balance.

3. Summary and Outlook

The team emphasizes the importance of a large‑scale, modular data environment that decouples training data, sandbox, and framework, allowing independent scaling of data sources and flexible framework switching. This architecture accelerates data expansion, supports diverse domains (code, mathematics, games, etc.), and enhances model robustness and generalization across unseen environments.

Open‑source resources:

KAT‑Dev‑72B‑Exp repository: https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp

code generation AI reinforcement learning agentic training

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.