Jun 15, 2026 · Artificial Intelligence

Breaking the SWE‑bench Score‑Only Myth: Open‑Source Benchmark that Independently Measures Harnesses

The article critiques the reliance on raw SWE‑bench scores for programming agents, introduces the Claw‑SWE‑Bench benchmark and a dedicated adapter that isolates harness effects, and presents extensive experiments showing how model choice, harness design, and cost impact real-world coding performance across multiple languages.

HarnessLLM AgentsPass@1

0 likes · 14 min read

Breaking the SWE‑bench Score‑Only Myth: Open‑Source Benchmark that Independently Measures Harnesses

Machine Learning Algorithms & Natural Language Processing

May 1, 2026 · Artificial Intelligence

Agentic Harness Engineering Enables Agents to Self‑Evolve and Outperform Codex in 10 Rounds

The Agentic Harness Engineering (AHE) framework lets coding agents automatically read massive execution traces, identify failure patterns, and iteratively modify harness components—prompt, tools, middleware, and memory—achieving a pass@1 increase from 69.7% to 77.0% and surpassing human‑tuned Codex‑CLI after ten automated evolution rounds.

Agentic Harness EngineeringBenchmarkingObservability

0 likes · 9 min read

Agentic Harness Engineering Enables Agents to Self‑Evolve and Outperform Codex in 10 Rounds