Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)
Kimi-Dev-72B, an open-source 72-billion-parameter code model from Moonshot AI, achieved a record 60.4% score on the SWE-bench Verified benchmark, surpassing larger models, and incorporates BugFixer/TestWriter dual roles, extensive mid-stage training on billions of GitHub data, and reinforcement-learning-driven self-play, with code available on Hugging Face and GitHub.
Kimi-Dev-72B Overview
Kimi-Dev-72B is a new open-source code model released by Moonshot AI (月之暗面) targeting software engineering tasks. The model, based on the Qwen 2.5-72B foundation, is available for download and deployment on Hugging Face and GitHub, providing model weights, source code, and an upcoming technical report.
Benchmark Performance
On the SWE-bench Verified programming benchmark, Kimi-Dev-72B achieved a high score of 60.4% , establishing a new state-of-the-art for open-source models and surpassing the 671B-parameter DeepSeek‑R1 released earlier in May.
Model Design
The model combines two complementary roles: BugFixer and TestWriter . Both follow a minimal two‑stage framework—file localization and code editing—allowing the model to locate the relevant file and apply the correct code changes, whether fixing a bug or adding a unit test.
BugFixer and TestWriter Combination
Successful patches must pass the exact unit tests that expose the bug, while reproducing the bug should trigger an assertion failure. This complementary design ensures the model excels at both generating correct patches and writing effective tests.
Mid‑stage Training
Moonshot AI performed extensive mid‑stage training on roughly 1500 billion high‑quality real data samples. Starting from the Qwen 2.5‑72B base, they collected millions of GitHub issues and pull‑request submissions, carefully curated to teach the model how developers reason about issues, write fixes, and create tests. All repositories from the SWE‑bench Verified set were removed to avoid data leakage.
Reinforcement Learning
The reinforcement‑learning phase focuses on improving code‑editing ability. Three key design choices were applied:
Result‑only reward : only the final Docker execution outcome (0 or 1) is used as reward, without any format‑or process‑based incentives.
Efficient prompt set : prompts that achieve zero success rate under multi‑sample evaluation are filtered out, and curriculum learning gradually increases task difficulty.
Positive‑example reinforcement : in the final training stage, recently successful samples are re‑introduced into the batch to strengthen successful patterns.
Test‑time Self‑Play
After reinforcement learning, Kimi-Dev-72B can simultaneously act as BugFixer and TestWriter. During testing, a self‑play mechanism coordinates bug‑fixing and test‑writing, generating up to 40 patch candidates and 40 test candidates per problem, demonstrating the expansion effect of self‑play.
Future Directions
Future iterations will focus on deeper integration with popular IDEs, version‑control systems, and CI/CD pipelines, making Kimi-Dev-72B seamlessly fit into developers' workflows. Moonshot AI commits to continuous improvement, rigorous red‑team testing, and releasing increasingly powerful models to the community.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.