Tagged articles

Pass@1

2 articles · Page 1 of 1
Machine Heart
Machine Heart
Jun 15, 2026 · Artificial Intelligence

Breaking the SWE‑bench Score‑Only Myth: Open‑Source Benchmark that Independently Measures Harnesses

The article critiques the reliance on raw SWE‑bench scores for programming agents, introduces the Claw‑SWE‑Bench benchmark and a dedicated adapter that isolates harness effects, and presents extensive experiments showing how model choice, harness design, and cost impact real-world coding performance across multiple languages.

HarnessLLM AgentsPass@1
0 likes · 14 min read
Breaking the SWE‑bench Score‑Only Myth: Open‑Source Benchmark that Independently Measures Harnesses
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 1, 2026 · Artificial Intelligence

Agentic Harness Engineering Enables Agents to Self‑Evolve and Outperform Codex in 10 Rounds

The Agentic Harness Engineering (AHE) framework lets coding agents automatically read massive execution traces, identify failure patterns, and iteratively modify harness components—prompt, tools, middleware, and memory—achieving a pass@1 increase from 69.7% to 77.0% and surpassing human‑tuned Codex‑CLI after ten automated evolution rounds.

Agentic Harness EngineeringBenchmarkingObservability
0 likes · 9 min read
Agentic Harness Engineering Enables Agents to Self‑Evolve and Outperform Codex in 10 Rounds