Artificial Intelligence 7 min read

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Kimi-Dev-72B, an open-source 72-billion-parameter code model from Moonshot AI, achieved a record 60.4% score on the SWE-bench Verified benchmark, surpassing larger models, and incorporates BugFixer/TestWriter dual roles, extensive mid-stage training on billions of GitHub data, and reinforcement-learning-driven self-play, with code available on Hugging Face and GitHub.

DataFunTalk

Jun 17, 2025

Kimi-Dev-72B Sets New Open‑Source SOTA on SWE‑bench Verified (60.4% Score)

Kimi-Dev-72B Overview

Kimi-Dev-72B is a new open-source code model released by Moonshot AI (月之暗面) targeting software engineering tasks. The model, based on the Qwen 2.5-72B foundation, is available for download and deployment on Hugging Face and GitHub, providing model weights, source code, and an upcoming technical report.

Benchmark Performance

On the SWE-bench Verified programming benchmark, Kimi-Dev-72B achieved a high score of 60.4% , establishing a new state-of-the-art for open-source models and surpassing the 671B-parameter DeepSeek‑R1 released earlier in May.

Model Design

The model combines two complementary roles: BugFixer and TestWriter . Both follow a minimal two‑stage framework—file localization and code editing—allowing the model to locate the relevant file and apply the correct code changes, whether fixing a bug or adding a unit test.

BugFixer and TestWriter Combination

Successful patches must pass the exact unit tests that expose the bug, while reproducing the bug should trigger an assertion failure. This complementary design ensures the model excels at both generating correct patches and writing effective tests.

Mid‑stage Training

Moonshot AI performed extensive mid‑stage training on roughly 1500 billion high‑quality real data samples. Starting from the Qwen 2.5‑72B base, they collected millions of GitHub issues and pull‑request submissions, carefully curated to teach the model how developers reason about issues, write fixes, and create tests. All repositories from the SWE‑bench Verified set were removed to avoid data leakage.

Reinforcement Learning

The reinforcement‑learning phase focuses on improving code‑editing ability. Three key design choices were applied:

Result‑only reward : only the final Docker execution outcome (0 or 1) is used as reward, without any format‑or process‑based incentives.

Efficient prompt set : prompts that achieve zero success rate under multi‑sample evaluation are filtered out, and curriculum learning gradually increases task difficulty.

Positive‑example reinforcement : in the final training stage, recently successful samples are re‑introduced into the batch to strengthen successful patterns.

Test‑time Self‑Play

After reinforcement learning, Kimi-Dev-72B can simultaneously act as BugFixer and TestWriter. During testing, a self‑play mechanism coordinates bug‑fixing and test‑writing, generating up to 40 patch candidates and 40 test candidates per problem, demonstrating the expansion effect of self‑play.

Future Directions

Future iterations will focus on deeper integration with popular IDEs, version‑control systems, and CI/CD pipelines, making Kimi-Dev-72B seamlessly fit into developers' workflows. Moonshot AI commits to continuous improvement, rigorous red‑team testing, and releasing increasingly powerful models to the community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Software Engineering reinforcement learning open-source SWE‑Bench

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.