Introducing DeNovoSWE: The First Long‑Horizon Doc2Repo Training Set for Code Agents
DeNovoSWE, a newly released large‑scale dataset of 4,818 high‑quality document‑to‑repository tasks, uses a Divide‑and‑Conquer and Critic‑Repair pipeline to generate well‑organized, evaluation‑aligned specifications, and experiments show it boosts LLM code agents’ repository‑level generation performance from single‑digit to over 40% on benchmarks.
Overview
DeNovoSWE is a dataset of 4,818 real‑world document‑to‑repository (Doc2Repo) instances. Each instance provides a single task document that is the only entry point for an agent to reconstruct a complete, testable code repository.
Dataset Construction
The dataset is generated automatically by a sandboxed multi‑agent workflow that follows a two‑stage Divide & Conquer process.
Divide stage : The target repository is analyzed and split into distinct capabilities (e.g., authentication, data I/O, batch processing). Execution traces from the original unit tests are collected to classify components into three categories:
direct components – APIs called directly by the tests;
core indirect components – APIs that affect observable behavior without being called directly;
non‑core indirect components – internal implementations that can be left to the agent.
This classification ensures that only essential APIs and behaviors are documented.
Conquer stage : For each capability, a Draft‑Critic‑Repair loop generates the specification.
The Draft agent writes an initial description of the capability.
The Critic agent checks the draft for missing APIs, contracts, or structural details required for evaluation.
The Repair agent iterates on the draft based on the Critic’s feedback until the description satisfies the evaluation criteria.
All capability sections are then merged into a single, well‑organized task document.
Long‑Horizon Challenge
To force agents to rely solely on the document, the dataset removes source code and test artifacts, resets the Git history, and clears caches (e.g., site‑packages, pip wheels, temporary compilation products). Consequently, agents must plan repository structure, create modules, design APIs, handle dependencies, and iteratively debug using only the provided specification.
Difficulty‑Aware Trajectory Filtering
Each task is scored by structural complexity and LLM‑predicted difficulty. The filtering policy applies higher pass‑rate thresholds to easier tasks while retaining harder tasks even with lower scores. This balances quality (high pass rates) against diversity (including challenging long‑horizon examples).
Experimental Results
Training on DeNovoSWE yields substantial improvements for code agents: Qwen3‑30B‑A3B‑Instruct on BeyondSWE‑Doc2Repo: success rate rises from 5.8 % (baseline) to 47.2 %.
Same model on NL2RepoBench: success rate rises from 4.3 % to 23.0 %.
Scale‑SWE‑Agent trained on issue‑level SWE data improves the baseline to 29.2 % (BeyondSWE‑Doc2Repo) and 18.3 % (NL2RepoBench), showing that conventional SWE data provide some transfer.
Using the stronger backbone Qwen3.5‑35B‑A3B, DeNovoSWE raises BeyondSWE‑Doc2Repo from 43.8 % to 50.0 % and NL2RepoBench from 23.5 % to 27.1 %.
These results demonstrate that high‑quality, long‑horizon data, rather than bug‑fix‑oriented datasets, are essential for training agents capable of full repository generation.
Resources
Paper: https://arxiv.org/pdf/2606.10728
GitHub repository: https://github.com/AweAI-Team/DeNovoSWE
Dataset on Hugging Face: https://huggingface.co/collections/AweAI-Team/denovoswe
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
