BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges

BeyondSWE expands code‑agent evaluation beyond single‑repo bug fixing by introducing four realistic scenarios, scaling to 246 repositories and 500 samples, revealing a sharp performance drop for top models and highlighting the nuanced impact of search‑augmented agents like SearchSWE.

AI evaluationBeyondSWESearchSWE

0 likes · 6 min read

BeyondSWE: Rethinking Code Agent Benchmarks with Real‑World Multi‑Repo Challenges