6 min read

BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges

BeyondSWE expands code‑agent evaluation beyond single‑repo bug fixing by introducing four realistic scenarios, scaling to 246 repositories and 500 samples, revealing a sharp performance drop for top models and highlighting the nuanced impact of search‑augmented agents like SearchSWE.

PaperAgent

Mar 6, 2026

BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges

BeyondSWE Redefines Code‑Agent Evaluation

While many models surpass 80% on SWE‑bench Verified, the benchmark only touches a tiny slice of real software engineering, where bugs may span multiple projects, require domain knowledge, or involve large‑scale migrations.

To address this gap, the RUC Gaoling AI Institute launched BeyondSWE , redesigning the evaluation along two axes: code change scope (how many files/lines are modified) and knowledge source (where the solution information resides). The benchmark covers four authentic scenarios:

CrossRepo (200 items): fix a bug in one repository while the answer lives in external resources such as upstream issues, Stack Overflow posts, or other projects.

DomainFix (72 items): tasks crossing disciplines like quantum physics or bioinformatics, vetted by domain experts; syntactically correct but physically wrong solutions receive zero.

DepMigrate (178 items): systematic migrations (e.g., Pydantic v1→v2, NumPy 1.x→2.0) affecting dozens of files.

Doc2Repo (50 items): given only a design document, agents must build a complete repository from scratch, with many examples exceeding 4,000 lines of code.

The dataset comprises 246 real repositories and 500 examples , averaging 5.6 modified files and 209.9 changed lines per task—about 18 times larger than SWE‑bench Verified.

Performance Collapse on the New Benchmark

Models that achieved >80% on SWE‑bench (Gemini 3 Pro, GPT‑5.2, DeepSeek‑V3.2, Kimi‑K2, Seed‑Coder, etc.) all fell below 45% on BeyondSWE, and no single model dominates across all four task categories. The DomainFix subset proved especially harsh, with no model surpassing 36%.

SearchSWE: Adding Internet Search to Agents

The SearchSWE framework equips agents with a search engine and web‑browsing tools while enforcing strict anti‑cheating measures. Experiments with nine models showed mixed results: six improved modestly, three regressed. Notably, Gemini 3 Pro improved the most (+2.0% overall) despite searching only once per task, whereas DeepSeek‑V3.2 searched 4–5 times per task and saw a slight decline.

Failure analysis identified three core issues:

Search engines excel at high‑level documentation, but code tasks often need information buried in source files or commit diffs (information‑landscape mismatch).

Search results are typically the latest version, while the local environment may be locked to an older release (temporal mismatch).

Rare libraries suffer from semantic drift, matching unrelated domains.

The key insight is that both search and coding capabilities are mature, yet their effective integration does not emerge automatically; models must learn when to search and how to use retrieved information . Advancing toward “Deep Research for Coding”—seamlessly blending search with reasoning—is identified as the next critical direction, with BeyondSWE and SearchSWE providing a systematic evaluation foundation.

Paper title: BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?
Project page: https://aweai-team.github.io/BeyondSWE/
Paper link: http://arxiv.org/abs/2603.03194
Code link: https://github.com/AweAI-Team/BeyondSWE
Scaffold link: https://github.com/AweAI-Team/AweAgent
Leaderboard: https://aweai-team.github.io/BeyondSWE_leaderboard/