Claw-Anything: Cross‑Device, Cross‑Time, Cross‑Service Benchmark for Scaling AI Agents (GPT‑5.5 Pass@1 = 34.5%)

Claw-Anything introduces a large‑scale, multi‑service benchmark that evaluates AI agents across long‑term histories, dozens of applications, and both GUI and CLI interfaces, revealing that even top‑tier closed‑source models like GPT‑5.5 achieve only a 34.5% pass rate while open‑source fine‑tuning gains a 23.7% improvement.

Machine Heart
Machine Heart
Machine Heart
Claw-Anything: Cross‑Device, Cross‑Time, Cross‑Service Benchmark for Scaling AI Agents (GPT‑5.5 Pass@1 = 34.5%)

Background

Always‑on personal AI assistants are expected to move beyond single‑turn tasks toward agents that can continuously perceive, reason, and act across a user’s fragmented digital life, spanning months, multiple services, and devices.

Claw-Anything Benchmark

The benchmark defines Scaling Agent Context by constructing a synthetic digital world that mimics real‑world complexity. It provides:

200 realistic personal‑assistant tasks covering 10.1 services on average (up to 18) and 191.7 k characters of context.

Cross‑device coverage of both GUI and CLI interactions.

Active‑service evaluation, requiring agents to act before explicit user commands.

Task Generation Pipeline

A lightweight LLM‑driven simulator takes a minimal persona description and iteratively samples events from a seed pool, building a coherent multi‑month timeline that includes noisy, contradictory, and irrelevant information. Each generated task includes a persona, a ground‑truth answer, and an executable validator, enabling fully automated evaluation without human involvement.

Example Scenario

"I have a meeting with a corporate client on June 4. Should I pay a part‑time assistant Lena to prepare the briefing or do it myself?"

The agent must consult the calendar, email, and finance apps, assess the $180 cost, consider hidden penalties of delaying a supplier meeting, and finally recommend delegating to Lena while respecting the user’s permission boundary (no unauthorized email sending).

Results

When testing leading models, even the strongest closed‑source model GPT‑5.5 achieved only a 34.5% pass@1 rate, with two‑thirds of tasks failing due to missed deadlines, mis‑calculated hidden costs, or over‑stepping authority. Fine‑tuning open‑source models (e.g., Qwen3.5‑27B) on the automatically generated 2 000 training environments raised success rates by 23.7%.

Ablation Studies

Increasing the amount of observable history, number of apps, or noise consistently decreased success rates, disproving the notion that larger context always helps.

Removing any of the three dimensions—long‑term event flow, cross‑app collaboration, or mobile device access—caused catastrophic drops in performance, confirming that agents must see the entire digital world to succeed.

Active tasks (agents must proactively surface pending items) were markedly harder than reactive ones, highlighting the gap between "answer‑when‑asked" and "anticipate‑before‑asked" capabilities.

Conclusion

Claw-Anything serves both as a rigorous benchmark exposing the current limitations of AI assistants and as a data‑generation engine that can continuously supply training material. The findings suggest that future progress hinges on expanding agents’ contextual awareness and proactive reasoning while enforcing strict permission boundaries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI AgentsLLMbenchmarkcross-deviceGPT-5.5Claw-Anything
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.