How SWE‑Swiss Enables a 32B Model to Match Larger LLMs on Software Engineering Tasks

Researchers from Peking University, ByteDance Seed, and Hong Kong University present SWE‑Swiss, a 32‑billion‑parameter model that, through a two‑stage training recipe and enhanced self‑consistency, achieves 60.2% accuracy on SWE‑bench Verified, matching larger models while remaining fully open‑source.

Data Party THU
Data Party THU
Data Party THU
How SWE‑Swiss Enables a 32B Model to Match Larger LLMs on Software Engineering Tasks

Introduction

Automating real‑world software‑engineering problems with large language models (LLMs) requires more than code generation; models must understand context, locate relevant files, produce correct patches, and verify them. Existing frameworks such as Agentless decompose the task into a structured workflow, but training an efficient model that masters all steps remains challenging.

Method Overview – The SWE‑Swiss Recipe

The SWE‑Swiss approach explicitly models three core capabilities:

Localization : accurately identify the files that need modification.

Repair : generate a correct code patch that resolves the issue.

Unit Test Generation : create unit tests to validate the patch.

High‑quality training data are built via a verification‑reject sampling pipeline: many candidate samples are generated, then a strict test‑based automatic verifier filters out only those that pass, ensuring reliable supervision for fine‑tuning.

Two‑Stage Training Procedure

Stage 1 – Multi‑task Supervised Fine‑Tuning (SFT) : 10,254 high‑quality examples covering the three skills are mixed and used to fine‑tune a Qwen2.5‑32B model. After this stage the model reaches 36.0 % accuracy on SWE‑bench Verified without test‑time expansion.

Stage 2 – Reinforcement Learning (RL) for Skill Mastery : Starting from the SFT checkpoint, RL focuses on improving the Repair skill. Inspired by POLARIS, the curriculum first trains 200 steps on the full dataset, then prunes easy samples (accuracy > 90 %) and continues 90 steps on the harder subset, encouraging the model to solve more challenging bugs.

Enhanced Self‑Consistency at Test Time

During inference SWE‑Swiss generates multiple patches and filters them using an enhanced self‑consistency mechanism. In addition to exact‑match frequency, the method rewards candidates that lie in a dense region of similar solutions, using a combined score of exact‑match count and average similarity to the top‑k nearest neighbors.

Results

The 32B SWE‑Swiss model achieves 60.2 % top‑score on the SWE‑bench Verified benchmark, comparable to much larger models such as Kimi‑Dev and DeepSeek‑R1‑0528. Performance progression is 36.0 % → 45.0 % after RL, and finally 60.2 % with test‑time expansion and enhanced self‑consistency.

Conclusion and Open‑Source Release

The SWE‑Swiss recipe shows that a carefully designed training pipeline can enable a mid‑scale 32B model to rival larger LLMs on software‑engineering tasks. The authors open‑source the SWE‑Swiss‑32B model, the full training dataset, and the code repository.

GitHub: https://github.com/zhenyuhe00/SWE-Swiss

Hugging Face model and data: https://huggingface.co/SWE-Swiss

Code example

来源:机器之心
本文
约2000字
,建议阅读
5
分钟
本项研究工作的核心贡献在于提出并验证了一套完整的、高效的 SWE-Swiss「配方」。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMSWE‑Swiss
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.