Artificial Intelligence 13 min read

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark

Washington University’s JobBench benchmark, built on a 1,500‑person Workbank survey and 130 real‑world tasks, measures how well AI agents can handle the chores professionals most want to delegate, revealing that even the strongest model, Claude Opus 4.7 + Claude Code, achieves just 45.9% overall, far below human‑level performance.

SuanNi

Jun 2, 2026

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark

Washington University and a consortium of research groups introduced JobBench, a benchmark that shifts AI evaluation from pure economic value to the tasks workers actually want to hand off. The benchmark draws on a Workbank survey of over 1,500 professionals who rated each O*NET duty on a 1‑5 willingness‑to‑delegate scale.

Human Preference First

JobBench selects 35 occupations where the average willingness score exceeds 3 and economic exposure is high. For each occupation, the highest‑willingness duties are filtered for digitalizability, evaluability, and supporting evidence, forming a pool of task designs.

Workplace Reasoning

Each of the 130 tasks (covering 35 occupations) includes a query, a set of heterogeneous reference files, binary criteria, and a rubric that encodes a step‑by‑step reasoning chain. A task’s rubric must be self‑contained, binary, objective, and unambiguous; all nodes must pass for the task to earn points, mirroring expert peer review.

Example: a journalist task requires cross‑checking water‑quality CSV data, EPA guidance, and monitoring reports to identify exceedances and draft a multi‑part editorial plan. Success depends on locating the correct source, not just producing a clean article.

Chain Scoring

JobBench contains 4,631 binary criteria (≈35.6 per task). Ambiguity was found to cause divergent LLM judge scores, so precision was baked into the rubric design. If any node fails, the entire chain scores zero.

Task Curation

Tasks pass three quality gates: automated audit of instruction‑reference consistency, annotator refinement, and trial runs with multiple agents where only tasks with ≥90% joint rubric pass rate are kept. Ultimately 71% of candidate tasks survive, and the surviving rubrics achieve a 95.4% joint pass rate.

Frontier Gap

Evaluating 36 model‑framework configurations, the top performer Claude Opus 4.7 + Claude Code scores 45.9%, while GPT‑5.5 + Codex CLI reaches 42.7% and GPT‑5.4 + Codex CLI 38.9%. All non‑Claude/GPT models score below 19 points; the weakest, Grok 4.2 Fast, scores 4.38.

Inference cost correlates with score: Claude Opus 4.7 costs ≈ $210 for the full suite, about five times GPT‑5.5’s $44, yet GPT‑5.5 offers the best cost‑performance at 42.7 points. The simplest viable setup, GPT‑5.3 + Codex CLI, costs $32.

Insights

Higher inference investment improves scores; GPT‑5.4’s score rises from ~31.9 to 38.9 as compute increases, but even maximal investment falls far short of full marks. Model‑framework choice matters: swapping Claude Sonnet 4.6 from Claude Code to OpenClaw drops its score from 36.9 to 30.6.

Analysis of 3,516 arXiv abstracts and 2,283 YC startup descriptions mapped to JobBench occupations shows a negative correlation between community attention and model capability (Pearson ≈ ‑0.15 for papers, ‑0.34 for YC). The “R&D quadrant” (high willingness, low capability) draws 1.5‑times more attention than the “Sweet Zone” (high willingness, high capability).

JobBench’s overarching goal is to reorient AI from pure economic substitution toward augmenting professionals by reliably handling the dirty, time‑consuming tasks they most want to offload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Prompt engineering AI Benchmark Model Comparison LLM evaluation agent performance JobBench

Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.