PaperAgent
PaperAgent
Mar 6, 2026 · Artificial Intelligence

BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges

BeyondSWE expands code‑agent evaluation beyond single‑repo bug fixing by introducing four realistic scenarios, scaling to 246 repositories and 500 samples, revealing a sharp performance drop for top models and highlighting the nuanced impact of search‑augmented agents like SearchSWE.

AI evaluationBeyondSWESearchSWE
0 likes · 6 min read
BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges