Breaking the SWE‑bench Score‑Only Myth: Open‑Source Benchmark that Independently Measures Harnesses

The article critiques the reliance on raw SWE‑bench scores for programming agents, introduces the Claw‑SWE‑Bench benchmark and a dedicated adapter that isolates harness effects, and presents extensive experiments showing how model choice, harness design, and cost impact real-world coding performance across multiple languages.

Machine Heart
Machine Heart
Machine Heart
Breaking the SWE‑bench Score‑Only Myth: Open‑Source Benchmark that Independently Measures Harnesses

Problem

SWE‑bench scores conflate three variables: the underlying LLM, the harness that wraps the model into an agent, and the task set. Changing any of these can shift scores by tens of percentage points, making direct horizontal comparison unreliable.

Adapter for OpenClaw

OpenClaw, a general‑purpose agent, cannot natively produce the diff patches required by SWE‑bench. The authors introduce a claw‑for‑coding adapter that translates OpenClaw’s tool‑use and file edits into standardized diff patches that the SWE‑bench evaluator can apply.

Benchmark Construction

Using the adapter they build Claw‑SWE‑Bench , a multilingual benchmark covering eight programming languages, 43 real repositories, and 350 GitHub issue‑fix tasks. A lightweight Claw‑SWE‑Bench Lite‑80 selects 80 representative instances (10 per language) for frequent iteration.

All harnesses share the same prompt template, 3600 s timeout per instance, Docker environment, and report total API cost, ensuring fair comparison of accuracy and expense.

Adapter Ablation

Two adapter variants are evaluated on 350 instances with GLM 5.1:

Bare adapter : forwards the task description and expects the model to output a unified diff directly. Result: Pass@1 = 19.1 % with a 69.1 % patch‑application failure rate.

Full adapter : lets the agent edit files in the workspace; a runner extracts the diff from Git state. Result: Pass@1 = 73.4 % and failure rate < 1.5 %.

Dataset Clean‑up

During construction the team discovered answer‑leakage in the original SWE‑bench‑Multilingual dataset: future commits were visible after base_commit. They removed the leaked history for non‑Python tasks, incorporated the fix into the evaluation pipeline, and submitted a PR upstream.

Model Sweep (Fixed Harness)

With OpenClaw fixed as the harness, nine LLMs were evaluated. Pass@1 ranged from 48.6 % (Seed 2.0‑mini) to 78.0 % (GPT 5.5), a span of 29.4 percentage points. API cost varied by up to two orders of magnitude. Notable cost‑accuracy points:

GPT 5.5: $1,399 total cost, Pass@1 = 78.0 %.

Claude Opus 4.7: $1,082, Pass@1 = 77.1 %.

DeepSeek‑V4 Flash: $8.2, Pass@1 = 70.3 %.

DeepSeek‑V4 Pro: $81, Pass@1 = 71.7 %.

Qwen 3.6‑flash: $71, Pass@1 = 66.0 %.

Harness Sweep (Fixed Model)

With GLM 5.1 and Qwen 3.6‑flash fixed as models, five harnesses (OpenClaw, Hermes‑agent, ZeroClaw, GenericAgent, Nanobot) were compared.

GLM 5.1 Pass@1 ranged from 60.9 % to 73.4 % (12.5 pp gap).

Qwen 3.6‑flash Pass@1 ranged from 38.6 % (Generic) to 66.0 % (OpenClaw), a 27.4 pp gap.

These results demonstrate that harness design can affect performance as much as changing the model.

Lite‑80 Subset

Lite‑80 selects 10 instances per language, controlling language distribution, difficulty quartiles, and repository coverage. It uses 17 calibration columns to fit the behavior of the full 350‑instance benchmark.

Cost of Lite‑80 is ~22.9 % of the full benchmark. On the 17 calibration columns, full‑350 average Pass@1 = 0.639, Lite‑80 = 0.643 (difference ≈ 0.4 pp). Coverage includes 34 of the 43 repositories (79 %).

Cost‑Accuracy Trade‑off

Pareto‑front analysis plots total API cost (x‑axis) vs Pass@1 (y‑axis). Key points:

Generic × Qwen 3.6‑flash: $14.5, Pass@1 = 38.6 % (lowest cost, lowest accuracy).

ZeroClaw × Qwen 3.6‑flash: $49, Pass@1 = 58.3 %.

OpenClaw × Qwen 3.6‑flash: $71, Pass@1 = 66.0 %.

OpenClaw × GLM 5.1: $277, Pass@1 = 73.4 % (highest accuracy, highest cost).

The analysis shows that combinations with similar Pass@1 can differ in cost by two orders of magnitude, underscoring the need to report both dimensions.

Conclusions

Programming‑agent evaluation must consider model strength, harness architecture, tool interfaces, runtime budget, and cost accounting rather than a single Pass@1 figure. The Claw‑SWE‑Bench suite provides a controlled platform to isolate these variables.

Paper: https://arxiv.org/pdf/2606.12344v1

GitHub repository: https://github.com/opensquilla/claw-swe-bench

Dataset on Hugging Face: https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkmultilingualcost analysisLLM agentsSWE-benchharnessPass@1
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.