How SWE‑Swiss Enables a 32B Model to Match Larger LLMs on Software Engineering Tasks
Researchers from Peking University, ByteDance Seed, and Hong Kong University present SWE‑Swiss, a 32‑billion‑parameter model that, through a two‑stage training recipe and enhanced self‑consistency, achieves 60.2% accuracy on SWE‑bench Verified, matching larger models while remaining fully open‑source.
Introduction
Automating real‑world software‑engineering problems with large language models (LLMs) requires more than code generation; models must understand context, locate relevant files, produce correct patches, and verify them. Existing frameworks such as Agentless decompose the task into a structured workflow, but training an efficient model that masters all steps remains challenging.
Method Overview – The SWE‑Swiss Recipe
The SWE‑Swiss approach explicitly models three core capabilities:
Localization : accurately identify the files that need modification.
Repair : generate a correct code patch that resolves the issue.
Unit Test Generation : create unit tests to validate the patch.
High‑quality training data are built via a verification‑reject sampling pipeline: many candidate samples are generated, then a strict test‑based automatic verifier filters out only those that pass, ensuring reliable supervision for fine‑tuning.
Two‑Stage Training Procedure
Stage 1 – Multi‑task Supervised Fine‑Tuning (SFT) : 10,254 high‑quality examples covering the three skills are mixed and used to fine‑tune a Qwen2.5‑32B model. After this stage the model reaches 36.0 % accuracy on SWE‑bench Verified without test‑time expansion.
Stage 2 – Reinforcement Learning (RL) for Skill Mastery : Starting from the SFT checkpoint, RL focuses on improving the Repair skill. Inspired by POLARIS, the curriculum first trains 200 steps on the full dataset, then prunes easy samples (accuracy > 90 %) and continues 90 steps on the harder subset, encouraging the model to solve more challenging bugs.
Enhanced Self‑Consistency at Test Time
During inference SWE‑Swiss generates multiple patches and filters them using an enhanced self‑consistency mechanism. In addition to exact‑match frequency, the method rewards candidates that lie in a dense region of similar solutions, using a combined score of exact‑match count and average similarity to the top‑k nearest neighbors.
Results
The 32B SWE‑Swiss model achieves 60.2 % top‑score on the SWE‑bench Verified benchmark, comparable to much larger models such as Kimi‑Dev and DeepSeek‑R1‑0528. Performance progression is 36.0 % → 45.0 % after RL, and finally 60.2 % with test‑time expansion and enhanced self‑consistency.
Conclusion and Open‑Source Release
The SWE‑Swiss recipe shows that a carefully designed training pipeline can enable a mid‑scale 32B model to rival larger LLMs on software‑engineering tasks. The authors open‑source the SWE‑Swiss‑32B model, the full training dataset, and the code repository.
GitHub: https://github.com/zhenyuhe00/SWE-Swiss
Hugging Face model and data: https://huggingface.co/SWE-Swiss
Code example
来源:机器之心
本文
约2000字
,建议阅读
5
分钟
本项研究工作的核心贡献在于提出并验证了一套完整的、高效的 SWE-Swiss「配方」。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
