ClawMark: A Living‑World Benchmark for Multi‑Turn, Multi‑Day, Multimodal Coworker Agents
The ClawMark benchmark introduces 100 multi‑turn, multi‑day tasks across 13 professional scenarios and five stateful sandbox services, evaluating seven cutting‑edge agent systems with a top weighted score of 75.8 but only a 20% strict success rate, highlighting the difficulty of end‑to‑end collaborative agent performance.
Language‑model agents are increasingly used as co‑workers that assist users across multi‑day workflows, where the environment changes independently (new emails, schedule updates, knowledge‑base edits, multimedia evidence). Existing benchmarks evaluate only single static episodes and text‑only inputs, which is insufficient.
To address this gap, the authors introduce ClawMark, a benchmark for collaborative agents. It features multi‑turn, multi‑day tasks executed in sandbox services whose state evolves between turns. The benchmark includes 100 tasks covering 13 professional scenarios, runs on five stateful services (file system, email, calendar, knowledge base, spreadsheet), and provides 1 537 deterministic Python checkers that score service states without using an LLM‑as‑judge.
The current release evaluates seven state‑of‑the‑art agent systems. The strongest model attains a weighted score of 75.8, yet its strict task‑success rate is only 20.0%, indicating that completing entire end‑to‑end workflows remains rare. Turn‑level analysis shows performance degradation after the first external environment update, highlighting adaptation to changing environments as a key open challenge.
All benchmark data, evaluation framework, and construction pipelines are open‑sourced to enable reproducible research on collaborative agents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
