Machine Heart
Jun 15, 2026 · Artificial Intelligence
Breaking the SWE‑bench Score‑Only Myth: Open‑Source Benchmark that Independently Measures Harnesses
The article critiques the reliance on raw SWE‑bench scores for programming agents, introduces the Claw‑SWE‑Bench benchmark and a dedicated adapter that isolates harness effects, and presents extensive experiments showing how model choice, harness design, and cost impact real-world coding performance across multiple languages.
HarnessLLM AgentsPass@1
0 likes · 14 min read
