How Harness Design Alters Coding Agent Scores: Insights from the First Independent Claw‑SWE‑Bench

The Claw‑SWE‑Bench benchmark isolates model, harness, and task variables, showing that changing only the harness can shift Pass@1 scores by up to 27 points and affect cost dramatically, while also providing a lightweight 80‑question Lite version for rapid, low‑cost evaluation.

SuanNi
SuanNi
SuanNi
How Harness Design Alters Coding Agent Scores: Insights from the First Independent Claw‑SWE‑Bench

Evaluating programming agents has long been ambiguous because scores on SWE‑bench and similar suites conflate three factors: the underlying model, the harness (framework) that drives the agent, and the specific task set.

Claw‑SWE‑Bench, a 350‑question multilingual benchmark covering eight languages and 43 repositories, decouples these variables by fixing the task set, prompts, runtime budget, and scoring pipeline, allowing only the model and harness to vary.

In the model‑axis experiment, nine models ranging from flagship GPT‑5.5 to lightweight Seed‑2.0‑mini were evaluated with a single harness (OpenClaw). GPT‑5.5 achieved the highest Pass@1 of 78.0 %, Claude Opus 4.7 followed at 77.1 %, and Seed 2.0‑mini scored the lowest at 48.6 %, a spread of 29.4 percentage points.

In the harness‑axis experiment, the same two models (GLM 5.1 and Qwen 3.6‑flash) were run with five different harnesses (OpenClaw, Hermes‑agent, ZeroClaw, NanoBot, Generic Agent). For GLM 5.1, Pass@1 ranged from 60.9 % to 73.4 % (12.5 pp difference). For the smaller Qwen 3.6‑flash, the gap widened to 27.4 pp (38.6 % → 66.0 %). The impact of harness design on scores is comparable to swapping to a higher‑tier model.

General‑purpose agents previously could not participate in SWE‑bench because the benchmark expects a diff‑patch directly applicable to a repository, while generic agents output varied formats (JSON, natural language, etc.) and may generate auxiliary files that pollute the diff. Claw‑SWE‑Bench introduces an Adapter layer that translates arbitrary agent interactions into a valid patch and confines code edits to a Docker /testbed directory.

A contrast experiment showed a bare adapter achieving only 19.1 % Pass@1, whereas the full adapter raised Pass@1 to 73.4 % and reduced apply‑failure rate below 1.5 %.

Cost analysis reveals that high Pass@1 does not guarantee low expense. GPT‑5.5 (78.0 % Pass@1) cost $1,399.1, while DeepSeek‑V4 Flash (70.3 % Pass@1) cost only $8.2, a 170‑fold difference for less than 8 pp score gap. Harness choice also drives cost: OpenClaw × Qwen 3.6‑flash achieved 66.0 % Pass@1 for $71.5, whereas Generic Agent × Qwen 3.6‑flash delivered 38.6 % Pass@1 for just $14.5.

To reduce evaluation cost during development, Claw‑SWE‑Bench Lite selects 80 representative questions (10 per language) from the full set, preserving difficulty distribution (2/3/3/2 quartile split). Lite’s average Pass@1 (0.643) differs by only 0.4 pp from the full benchmark (0.639). Across 5 harnesses and 2 models, the average absolute difference is 1.88 pp, with a maximum of 3.68 pp.

Cost breakdown shows Lite reduces total expense to about 22.9 % of the full run, with input tokens, output tokens, cache reads, and wall‑clock time each contributing roughly a quarter of the cost, indicating that Lite shrinks scale rather than cherry‑picking cheap tasks.

Overall, Claw‑SWE‑Bench demonstrates that harness design is as critical as model capability for coding agents, and that both score and cost can vary dramatically across harnesses. When interpreting SWE‑bench leaderboards, one should always consider which harness was used.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

benchmarkmodel evaluationcost analysisAI coding agentsharness designClaw-SWE-Bench
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.