Why the Best AI Scores Only 45.9% on JobBench’s ‘Dirty Work’ Benchmark
Washington University’s JobBench benchmark, built on a 1,500‑person Workbank survey and 130 real‑world tasks, measures how well AI agents can handle the chores professionals most want to delegate, revealing that even the strongest model, Claude Opus 4.7 + Claude Code, achieves just 45.9% overall, far below human‑level performance.
