Artificial Intelligence 10 min read

Introducing DeNovoSWE: The First Long‑Horizon Doc2Repo Training Set for Code Agents

DeNovoSWE, a newly released large‑scale dataset of 4,818 high‑quality document‑to‑repository tasks, uses a Divide‑and‑Conquer and Critic‑Repair pipeline to generate well‑organized, evaluation‑aligned specifications, and experiments show it boosts LLM code agents’ repository‑level generation performance from single‑digit to over 40% on benchmarks.

Machine Learning Algorithms & Natural Language Processing

Jun 25, 2026

Introducing DeNovoSWE: The First Long‑Horizon Doc2Repo Training Set for Code Agents

Overview

DeNovoSWE is a dataset of 4,818 real‑world document‑to‑repository (Doc2Repo) instances. Each instance provides a single task document that is the only entry point for an agent to reconstruct a complete, testable code repository.

Dataset Construction

The dataset is generated automatically by a sandboxed multi‑agent workflow that follows a two‑stage Divide & Conquer process.

Divide stage : The target repository is analyzed and split into distinct capabilities (e.g., authentication, data I/O, batch processing). Execution traces from the original unit tests are collected to classify components into three categories:

direct components – APIs called directly by the tests;

core indirect components – APIs that affect observable behavior without being called directly;

non‑core indirect components – internal implementations that can be left to the agent.

This classification ensures that only essential APIs and behaviors are documented.

Conquer stage : For each capability, a Draft‑Critic‑Repair loop generates the specification.

The Draft agent writes an initial description of the capability.

The Critic agent checks the draft for missing APIs, contracts, or structural details required for evaluation.

The Repair agent iterates on the draft based on the Critic’s feedback until the description satisfies the evaluation criteria.

All capability sections are then merged into a single, well‑organized task document.

Long‑Horizon Challenge

To force agents to rely solely on the document, the dataset removes source code and test artifacts, resets the Git history, and clears caches (e.g., site‑packages, pip wheels, temporary compilation products). Consequently, agents must plan repository structure, create modules, design APIs, handle dependencies, and iteratively debug using only the provided specification.

Difficulty‑Aware Trajectory Filtering

Each task is scored by structural complexity and LLM‑predicted difficulty. The filtering policy applies higher pass‑rate thresholds to easier tasks while retaining harder tasks even with lower scores. This balances quality (high pass rates) against diversity (including challenging long‑horizon examples).

Experimental Results

Training on DeNovoSWE yields substantial improvements for code agents: Qwen3‑30B‑A3B‑Instruct on BeyondSWE‑Doc2Repo: success rate rises from 5.8 % (baseline) to 47.2 %.

Same model on NL2RepoBench: success rate rises from 4.3 % to 23.0 %.

Scale‑SWE‑Agent trained on issue‑level SWE data improves the baseline to 29.2 % (BeyondSWE‑Doc2Repo) and 18.3 % (NL2RepoBench), showing that conventional SWE data provide some transfer.

Using the stronger backbone Qwen3.5‑35B‑A3B, DeNovoSWE raises BeyondSWE‑Doc2Repo from 43.8 % to 50.0 % and NL2RepoBench from 23.5 % to 27.1 %.

These results demonstrate that high‑quality, long‑horizon data, rather than bug‑fix‑oriented datasets, are essential for training agents capable of full repository generation.

Resources

Paper: https://arxiv.org/pdf/2606.10728

GitHub repository: https://github.com/AweAI-Team/DeNovoSWE

Dataset on Hugging Face: https://huggingface.co/collections/AweAI-Team/denovoswe

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Software Engineering benchmark dataset code agents long-horizon

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.