How Scale‑SWE’s Real‑World Software Engineering Dataset Supercharges AI Models

The Scale‑SWE project releases a 100k‑task real software‑engineering dataset built with a sandboxed multi‑agent workflow, demonstrating that models fine‑tuned on this data achieve 64% on SWE‑bench‑Verified and surpass leading industrial baselines, highlighting the critical value of authentic SWE data.

PaperAgent
PaperAgent
PaperAgent
How Scale‑SWE’s Real‑World Software Engineering Dataset Supercharges AI Models

Scale‑SWE is an open‑source dataset containing 100,000 real software‑engineering (SWE) tasks, created using a novel sandboxed multi‑agent workflow that extracts high‑quality data from thousands of GitHub repositories.

Why Real SWE Data Matters

Synthetic datasets such as SWE‑smith often suffer from severe type‑distribution imbalance, focusing mainly on simple logic errors, whereas Scale‑SWE provides a balanced distribution of task types that more accurately reflects real engineering challenges.

Technical Breakthroughs: Overcoming Three Scaling Barriers

Previous attempts to build authentic SWE datasets faced three major obstacles: (1) extremely complex environment configuration, (2) lack of unit tests, and (3) problem‑statement leakage. Scale‑SWE introduces three specialized agents to address these issues.

1. Environment Builder Agent (EBA)

EBA operates in an isolated sandbox, automatically explores a repository’s structure, reads configuration files such as README.md or pyproject.toml, runs test scripts, and iteratively fixes failures, achieving fully automated complex environment setup.

2. Unit‑test Creator Agent (UCA)

UCA generates unit tests directly from pull‑request diffs, producing both Fail‑to‑Pass (F2P) and Pass‑to‑Pass (P2P) cases. By switching between commits and running these tests, UCA validates the effectiveness of the generated tests, turning otherwise discarded code into valuable test data.

3. Problem Statement Writer Agent (PSWA)

To avoid leaking bug locations or solutions, PSWA leverages the Gemini 3 Pro model with carefully crafted prompts, ensuring that generated problem statements remain semantically aligned with F2P tests while containing no cheating clues. Ablation studies show that high‑quality problem statements improve downstream supervised‑fine‑tuning performance by nearly 10%.

Evaluation: Scale and Quality Verified

Using DeepSeek v3.2, the team distilled 71,000 effective trajectories from Scale‑SWE and fine‑tuned the Qwen3‑30A3B‑Instruct model, which achieved a 64% score on SWE‑bench‑Verified. This outperforms the baseline Qwen3‑Coder‑30A3B, the industrial‑grade GLM‑4.7‑Flash‑30A3B, and even models trained on other large datasets such as KAT‑Dev‑32B and SWE‑Lego‑32B.

When compared with the synthetic SWE‑smith dataset—despite SWE‑smith’s larger quantity—the performance gap remains minimal, whereas Scale‑SWE’s larger, authentic data yields a clear, discontinuous advantage.

The release aims to provide a solid data infrastructure for AI research in software engineering, offering ready‑to‑use real data and distilled trajectories to significantly lower the entry barrier for the community.

Paper title: "Immersion in the GitHub Universe: Scaling Coding Agents to Mastery"
Paper link: https://arxiv.org/abs/2602.09892
Code repository: https://github.com/AweAI-Team/ScaleSWE
Open dataset: https://huggingface.co/collections/AweAI-Team/scale-swe
Scaffold address: https://github.com/AweAI-Team/AweAgent/tree/main/recipes/scale_swe
AI agentsmodel evaluationmulti-agent workflowQwen3-30A3B-InstructScale-SWEsoftware engineering dataset
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.