Why AI Agents Are Unstable: A Systematic Benchmark Dissects Their Weaknesses
LiveClawBench, a new benchmark for LLM agents, reveals that task domain explains only a small fraction of performance variance while a detailed complexity profile accounts for much more, and it uses full‑stack mock workflows and trajectory analysis to diagnose why even top models remain unstable in personal‑assistant tasks.
LiveClawBench is a benchmark released by Samsung’s large‑model team together with researchers from Peking University, City University of Hong Kong and Hong Kong University of Science and Technology. It evaluates the stability of LLM agents on realistic personal‑assistant workflows rather than merely ranking overall capability.
Motivation
The core question is why the same AI agent can be near‑usable on some tasks yet become unstable on others. The authors argue that the answer lies beyond the traditional "task domain" perspective.
Empirical Findings
For high‑performance models (e.g., Kimi‑K2.7‑Code, GLM‑5.2, GPT‑5.5) the task domain explains only about 9.6 % of case‑level score variance, while a "complexity profile" explains roughly 18.6 % . For medium‑performance models (e.g., DeepSeek‑V4 Flash) the domain explains ~ 12.9 % and complexity ~ 21.1 % . For low‑performance models the domain explains ~ 17.7 % and complexity ~ 16.1 % . This indicates that once a model acquires basic cross‑domain ability, remaining performance differences are driven by deeper structural complexities within tasks.
Complexity Profile
A complexity profile is a checklist of challenges a task poses, asking "what exactly makes this task hard" rather than "which domain does it belong to". Examples include cross‑service coordination, hidden‑goal inference, long‑term knowledge maintenance, and runtime state perturbations.
Benchmark Construction
Build a personal‑assistant workflow benchmark that evaluates mainstream LLM agents on realistic multi‑service tasks.
Introduce a structured complexity‑factor system that annotates each real‑world task with measurable pressure points.
Deploy a full‑stack executable mock environment and perform trajectory analysis to link final scores, environment state changes, and behavior patterns.
LiveClawBench contains 134 executable cases covering 10 OpenClaw application domains, built on 22 reusable mock services. Tasks are full personal‑assistant workflows that require stateful operations across files, services, and contexts. Success is measured by the final environment state, not merely by API calls.
Stability Metric
The benchmark defines Pass^3 (score > 0.8 in three independent runs) to capture high‑quality, repeatable execution. Even strong models such as GPT‑5.5 achieve only 5.3 % Pass^3 on 22 hard tasks, indicating a clear gap between software‑operation capability and reliable personal‑assistant performance.
Three‑Axis Complexity Framework
A: Environment Complexity – cross‑service dependencies (A1) and polluted initial state (A2).
B: Cognitive Burden – implicit goal parsing (B1) and knowledge maintenance (B2).
C: Runtime Adaptability – environment perturbation (C1) and result verification after changes (C2).
Each case is annotated with these factors, allowing the benchmark to pinpoint which pressures cause instability.
Impact of Complexity Factors
Cross‑service dependencies often lead to state inconsistency: the agent may complete a local step but fail to synchronize across services. Implicit‑goal parsing errors cause the agent to execute many steps that do not satisfy the user’s true intent. Runtime‑adaptability issues arise when the environment changes mid‑execution and the agent does not re‑validate the final state.
Why Full‑Stack Mock Over Simple API Mock?
Many existing benchmarks use isolated endpoint stubs, which only test tool‑selection ability. LiveClawBench instead provides a containerized environment where agents interact with browsers, service APIs, file systems, databases, and audit logs. This preserves state transitions, artifact updates, and side‑effects, enabling measurement of whether the user’s goal is truly achieved.
Trajectory Analysis
From execution traces the authors extract metrics such as step count, tool‑call density, repeated calls, tool diversity, error‑recovery actions, verification steps, blind‑edit rate, and termination behavior. Results show distinct behavior patterns for different complexity factors. For example, tasks with high environment‑complexity cause agents to "think more and act more" without necessarily improving success, while knowledge‑maintenance tasks often see premature termination after satisfying the literal request.
Diagnostic Value
LiveClawBench functions as a diagnostic tool: the benchmark reports overall ability, the complexity‑factor system decomposes task difficulty, the full‑stack mock retains verifiable execution evidence, and trajectory analysis explains how agents behave under specific pressures.
Implications for Future Agents
The analysis suggests concrete improvement directions: more robust multi‑service orchestration, proactive implicit‑goal confirmation, reliable polluted‑state diagnosis, deeper long‑term knowledge maintenance, stricter self‑check of task completion, and tighter safety controls on side‑effects.
Paper title: LiveClawBench: Benchmarking LLM Agents on Complex, Real‑World Assistant Tasks
ArXiv link: https://arxiv.org/abs/2604.13072
GitHub repository: https://github.com/Mosi-AI/LiveClawBench
Trajectory dataset: https://huggingface.co/datasets/Mosi-AI/LiveClawbench-trajectories
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
