Why AI Agents Are Unstable: A Systematic Benchmark Dissects Their Weaknesses

LiveClawBench, a new benchmark for LLM agents, reveals that task domain explains only a small fraction of performance variance while a detailed complexity profile accounts for much more, and it uses full‑stack mock workflows and trajectory analysis to diagnose why even top models remain unstable in personal‑assistant tasks.

Machine Heart
Machine Heart
Machine Heart
Why AI Agents Are Unstable: A Systematic Benchmark Dissects Their Weaknesses

LiveClawBench is a benchmark released by Samsung’s large‑model team together with researchers from Peking University, City University of Hong Kong and Hong Kong University of Science and Technology. It evaluates the stability of LLM agents on realistic personal‑assistant workflows rather than merely ranking overall capability.

Motivation

The core question is why the same AI agent can be near‑usable on some tasks yet become unstable on others. The authors argue that the answer lies beyond the traditional "task domain" perspective.

Empirical Findings

For high‑performance models (e.g., Kimi‑K2.7‑Code, GLM‑5.2, GPT‑5.5) the task domain explains only about 9.6 % of case‑level score variance, while a "complexity profile" explains roughly 18.6 % . For medium‑performance models (e.g., DeepSeek‑V4 Flash) the domain explains ~ 12.9 % and complexity ~ 21.1 % . For low‑performance models the domain explains ~ 17.7 % and complexity ~ 16.1 % . This indicates that once a model acquires basic cross‑domain ability, remaining performance differences are driven by deeper structural complexities within tasks.

Complexity Profile

A complexity profile is a checklist of challenges a task poses, asking "what exactly makes this task hard" rather than "which domain does it belong to". Examples include cross‑service coordination, hidden‑goal inference, long‑term knowledge maintenance, and runtime state perturbations.

Benchmark Construction

Build a personal‑assistant workflow benchmark that evaluates mainstream LLM agents on realistic multi‑service tasks.

Introduce a structured complexity‑factor system that annotates each real‑world task with measurable pressure points.

Deploy a full‑stack executable mock environment and perform trajectory analysis to link final scores, environment state changes, and behavior patterns.

LiveClawBench contains 134 executable cases covering 10 OpenClaw application domains, built on 22 reusable mock services. Tasks are full personal‑assistant workflows that require stateful operations across files, services, and contexts. Success is measured by the final environment state, not merely by API calls.

Stability Metric

The benchmark defines Pass^3 (score > 0.8 in three independent runs) to capture high‑quality, repeatable execution. Even strong models such as GPT‑5.5 achieve only 5.3 % Pass^3 on 22 hard tasks, indicating a clear gap between software‑operation capability and reliable personal‑assistant performance.

Three‑Axis Complexity Framework

A: Environment Complexity – cross‑service dependencies (A1) and polluted initial state (A2).

B: Cognitive Burden – implicit goal parsing (B1) and knowledge maintenance (B2).

C: Runtime Adaptability – environment perturbation (C1) and result verification after changes (C2).

Each case is annotated with these factors, allowing the benchmark to pinpoint which pressures cause instability.

Impact of Complexity Factors

Cross‑service dependencies often lead to state inconsistency: the agent may complete a local step but fail to synchronize across services. Implicit‑goal parsing errors cause the agent to execute many steps that do not satisfy the user’s true intent. Runtime‑adaptability issues arise when the environment changes mid‑execution and the agent does not re‑validate the final state.

Why Full‑Stack Mock Over Simple API Mock?

Many existing benchmarks use isolated endpoint stubs, which only test tool‑selection ability. LiveClawBench instead provides a containerized environment where agents interact with browsers, service APIs, file systems, databases, and audit logs. This preserves state transitions, artifact updates, and side‑effects, enabling measurement of whether the user’s goal is truly achieved.

Trajectory Analysis

From execution traces the authors extract metrics such as step count, tool‑call density, repeated calls, tool diversity, error‑recovery actions, verification steps, blind‑edit rate, and termination behavior. Results show distinct behavior patterns for different complexity factors. For example, tasks with high environment‑complexity cause agents to "think more and act more" without necessarily improving success, while knowledge‑maintenance tasks often see premature termination after satisfying the literal request.

Diagnostic Value

LiveClawBench functions as a diagnostic tool: the benchmark reports overall ability, the complexity‑factor system decomposes task difficulty, the full‑stack mock retains verifiable execution evidence, and trajectory analysis explains how agents behave under specific pressures.

Implications for Future Agents

The analysis suggests concrete improvement directions: more robust multi‑service orchestration, proactive implicit‑goal confirmation, reliable polluted‑state diagnosis, deeper long‑term knowledge maintenance, stricter self‑check of task completion, and tighter safety controls on side‑effects.

Paper title: LiveClawBench: Benchmarking LLM Agents on Complex, Real‑World Assistant Tasks

ArXiv link: https://arxiv.org/abs/2604.13072

GitHub repository: https://github.com/Mosi-AI/LiveClawBench

Trajectory dataset: https://huggingface.co/datasets/Mosi-AI/LiveClawbench-trajectories

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMAI AgentbenchmarkComplexity AnalysisLiveClawBenchFull-stack Mock
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.