Artificial Intelligence 11 min read

How Scale‑SWE Enables 100k Real‑World Coding Tasks for AI Agents

The Scale‑SWE project combines a massive 100k‑sample software engineering dataset with a high‑concurrency sandbox infrastructure and a multi‑agent workflow to dramatically improve code‑agent training, evaluation, and real‑world performance, surpassing existing models on SWE‑bench benchmarks.

ByteDance SE Lab

Apr 7, 2026

How Scale‑SWE Enables 100k Real‑World Coding Tasks for AI Agents

Background

The Scale‑SWE dataset, released by researchers from Renmin University of China and Bytedance, contains 100,000 real pull‑request samples from GitHub, making it the largest open‑source, high‑quality software‑engineering (SWE) dataset to date. The dataset was constructed using Volcano Engine’s high‑concurrency sandbox infrastructure, enabling large‑scale automated environment setup, test execution, and data collection.

Why Real SWE Data Matters

Earlier synthetic datasets such as SWE‑smith and SWE‑Mirror generated tens of thousands of examples from a limited set of repositories, resulting in a distribution heavily biased toward logic error types. In contrast, real‑world datasets like Scale‑SWE and SWE‑Gym exhibit a balanced category distribution that more accurately reflects practical development scenarios.

Challenges in Scaling Real SWE Data

High‑concurrency sandbox scheduling : Building SWE data requires repeated container creation, image pulling, and test execution, which cannot be efficiently performed on a single physical machine.

Complex environment configuration : Many pull requests involve non‑trivial build steps beyond simple pip install -e . or static requirements.txt. Dynamic verification with pytest and on‑the‑fly adjustments are necessary.

Scarcity of unit tests : A large fraction of high‑quality pull requests lack accompanying unit tests, making it difficult to generate reliable fail‑to‑pass (F2P) or pass‑to‑pass (P2P) test cases.

Problem statement leakage : Directly prompting large language models with raw diffs can unintentionally expose bug locations or solutions, requiring careful generation of problem statements.

Scale‑SWE Multi‑Agent Workflow

The dataset construction pipeline uses a sandbox‑based multi‑agent system composed of three core agents:

Environment Builder Agent (EBA) : Automatically discovers the repository structure, locates configuration files such as README.md, setup.py, and pyproject.toml, and iteratively runs pytest inside the sandbox. Errors from the test run are fed back to adjust the environment until the tests pass.

Unit‑test Creator Agent (UCA) : For pull requests without existing unit tests, UCA analyses the code diff together with the full repository, generates missing F2P/P2P test cases, commits them, and validates them by executing the tests in the sandbox. Massive parallelism is required to handle the 100k‑sample scale.

Problem Statement Writer Agent (PSWA) : Generates concise problem descriptions that avoid leaking the bug location or solution. The team employed Gemini 3‑Pro for its strong instruction‑following ability; ablation studies showed that higher‑quality statements improve supervised‑fine‑tuning (SFT) performance on SWE‑bench by up to 10%.

Sandbox Infrastructure and Performance

Volcano Engine’s sandbox platform can schedule thousands of containers concurrently. By allocating roughly 5,000 sandbox instances, the team reduced the end‑to‑end dataset construction time from about one month on a single machine to approximately one hour. The system caches container images to minimise pull latency and provides stable resource isolation, preventing CPU contention during intensive pytest runs.

Experimental Evaluation

Data distillation was performed with DeepSeek v3.2, yielding 71,000 successful execution trajectories. These trajectories were used to fine‑tune Qwen3‑30B‑A3B‑Instruct, producing the Scale‑SWE‑Agent. On the SWE‑bench‑Verified benchmark, the Scale‑SWE‑Agent outperformed:

Qwen3‑Coder‑30A3B

GLM‑4.7‑Flash‑30A3B

KAT‑Dev‑32B

SWE‑Lego‑32B (trained on other datasets)

Comparative experiments across datasets demonstrated that, despite SWE‑smith’s larger volume, its performance matched SWE‑Gym, whereas Scale‑SWE consistently achieved superior scores, confirming the value of real, large‑scale data.

Future Outlook

The authors intend to make Scale‑SWE openly available to lower the entry barrier for SWE research. The repository and dataset can be accessed at:

https://github.com/AweAI-Team/ScaleSWE

Dataset files are hosted on Hugging Face:

https://huggingface.co/collections/AweAI-Team/scale-swe

For full methodological details and additional results, see the paper:

https://arxiv.org/abs/2602.09892

software engineering AI scaling multi-agent workflow code agents sandbox infrastructure SWE dataset

Written by

ByteDance SE Lab

Official account of ByteDance SE Lab, sharing research and practical experience in software engineering. Our lab unites researchers and engineers from various domains to accelerate the fusion of software engineering and AI, driving technological progress in every phase of software development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.