ClawMark: A Living‑World Benchmark for Multi‑Turn, Multi‑Day, Multimodal Coworker Agents

The ClawMark benchmark introduces 100 multi‑turn, multi‑day tasks across 13 professional scenarios and five stateful sandbox services, evaluating seven cutting‑edge agent systems with a top weighted score of 75.8 but only a 20% strict success rate, highlighting the difficulty of end‑to‑end collaborative agent performance.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
ClawMark: A Living‑World Benchmark for Multi‑Turn, Multi‑Day, Multimodal Coworker Agents

Language‑model agents are increasingly used as co‑workers that assist users across multi‑day workflows, where the environment changes independently (new emails, schedule updates, knowledge‑base edits, multimedia evidence). Existing benchmarks evaluate only single static episodes and text‑only inputs, which is insufficient.

To address this gap, the authors introduce ClawMark, a benchmark for collaborative agents. It features multi‑turn, multi‑day tasks executed in sandbox services whose state evolves between turns. The benchmark includes 100 tasks covering 13 professional scenarios, runs on five stateful services (file system, email, calendar, knowledge base, spreadsheet), and provides 1 537 deterministic Python checkers that score service states without using an LLM‑as‑judge.

The current release evaluates seven state‑of‑the‑art agent systems. The strongest model attains a weighted score of 75.8, yet its strict task‑success rate is only 20.0%, indicating that completing entire end‑to‑end workflows remains rare. Turn‑level analysis shows performance degradation after the first external environment update, highlighting adaptation to changing environments as a key open challenge.

All benchmark data, evaluation framework, and construction pipelines are open‑sourced to enable reproducible research on collaborative agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMbenchmarkmultimodal agentsagent performanceco‑worker agentssandbox evaluation
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.