Artificial Intelligence 7 min read

Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough

The article explains that modern AI agents must be judged on actual task execution and audit evidence, and Claw‑Eval‑Live reveals that while agents can use terminals, they still fail dramatically on cross‑system workflows such as HR, management, and operations, with no model surpassing a 70% pass rate.

PaperAgent

May 4, 2026

Why Claude 4.6 Scores Only 66%: Claw‑Eval‑Live Shows Terminal Skills Aren’t Enough

Today's AI Agents are no longer limited to answering questions; they can call APIs, query databases, modify workspaces, and trigger services. Consequently, evaluating them requires more than checking whether the final answer looks correct—it must verify that the agent truly performed the required actions safely and in the correct environment.

Claw‑Eval‑Live is the live extension of the Claw‑Eval series: the former first determines whether an Agent actually completed the task, and the latter further asks whether the benchmarked task still represents real‑world workflows.

The core of Claw‑Eval is to turn the execution process into auditable evidence. Each evaluation runs in an isolated environment, and scoring is based on execution traces, server audit logs, and post‑run environment snapshots rather than just the final output. The paper’s ablation study shows that when LLM judges only see dialogue and scoring scripts without audit logs and snapshots, they miss 44% of security violations and 13% of robustness issues, meaning result‑only evaluation systematically overestimates Agent performance.

However, accurate scoring is insufficient because agents operate on workflows that evolve over time. Today’s common tasks involve cross‑system reconciliation; tomorrow they may involve HR onboarding, ticket routing, calendar coordination, or supplier payment verification. Static benchmarks are reproducible but may no longer reflect actual demand.

Claw‑Eval‑Live addresses this drift. Instead of randomly changing questions each day, each release becomes a timestamped real‑world slice: the signal layer observes public workflow demand signals, while the release layer freezes task definitions, execution environments, data fixtures, and scoring scripts, ensuring results remain reproducible and comparable.

The ClawHub signals are not ground‑truth demand nor an automatic question generator; they are a public, inspectable demand prior. The system collects signals, clusters patterns, weights families, runs candidate tasks, and then uses MILP to select public tasks while constraining release scale, family coverage, and leaderboard discrimination.

The current public release contains 105 tasks, 17 task families, and 13 frontier models . Each task is a complete executable unit: a task.yaml, tool interface, data fixture, and grader.py are all required.

Scoring also avoids “looks‑reasonable‑gets‑points”. Claw‑Eval‑Live prioritizes deterministic evidence: correct tool invocation, matching entities and values to ground truth, and genuine state changes. Only semantic dimensions like report organization and summary quality involve a structured LLM judge.

Experimental results are stark: No model exceeds a 70% pass rate, and the gap between the top and bottom models is 22.9 percentage points. Some models have similar pass rates but differ in Overall Completion, indicating they miss a single tool call, a piece of evidence, or a state cleanup.

The most counter‑intuitive finding is that the hardest challenges are not terminal usage. Development/Terminal tasks are near the ceiling for strong models; the real bottlenecks are HR/People, Management/Ops, and cross‑system workflows. HR tasks have an average pass rate of only 6.8%, and WORKFLOW tasks only 12.8%, showing that current Agents struggle with gathering evidence across systems, correctly linking records, and performing required write operations.

Claw‑Eval demonstrates that Agent evaluation cannot rely solely on final results; Claw‑Eval‑Live further shows that benchmarks must move beyond static question banks. Together they split the problem: first confirm the Agent truly performed the work, then ensure the benchmark reflects the most relevant contemporary workflows.

Paper: https://arxiv.org/abs/2604.28139
Leaderboard: https://claw-eval-live.github.io
Code: https://github.com/Claw-Eval-Live/Claw-Eval-Live

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Agents LLM workflow benchmark Evaluation Claw-Eval cross-system

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.