Jun 2, 2026 · Artificial Intelligence

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark

Washington University’s JobBench benchmark, built on a 1,500‑person Workbank survey and 130 real‑world tasks, measures how well AI agents can handle the chores professionals most want to delegate, revealing that even the strongest model, Claude Opus 4.7 + Claude Code, achieves just 45.9% overall, far below human‑level performance.

AI BenchmarkJobBenchLLM evaluation

0 likes · 13 min read

Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark