Artificial Intelligence 17 min read

Why AI Agents Are Unstable: A Systematic Benchmark Dissects Their Weaknesses

LiveClawBench, a new benchmark for LLM agents, reveals that task domain explains only a small fraction of performance variance while a detailed complexity profile accounts for much more, exposing why even state‑of‑the‑art agents remain unstable on personal‑assistant workflows and offering a diagnostic framework to pinpoint and address specific failure modes.

Machine Learning Algorithms & Natural Language Processing

Jul 3, 2026

Why AI Agents Are Unstable: A Systematic Benchmark Dissects Their Weaknesses

LiveClawBench is a benchmark for evaluating large‑language‑model (LLM) agents on realistic personal‑assistant workflows. It contains 134 executable cases covering 10 application domains, built on 22 reusable full‑stack mock services.

Benchmark Design

Construction of a personal‑assistant workflow benchmark with executable cases.

A complexity‑factor system that decomposes real‑world task difficulty into measurable pressures.

Full‑stack mock environments and trajectory analysis that link model scores, environment state changes, and behavior patterns.

Key Findings

Task domain explains only a limited portion of case‑level score variance: ~9.6 % for high‑performance models (Kimi‑K2.7‑Code, GLM‑5.2, GPT‑5.5) and up to 17.7 % for low‑performance models. The complexity profile explains substantially more—18.6 % for high‑performance, 21.1 % for medium, and 16.1 % for low‑performance models. Even the strongest models achieve a Pass³ (score > 0.8 in three runs) of only 5.3 % on 22 hard tasks, indicating persistent instability in stateful, cross‑service workflows.

Complexity Factors

The benchmark defines three axes of complexity:

A: Environment Complexity – cross‑service dependencies (A1) and polluted initial state (A2).

B: Cognitive Load – implicit goal parsing (B1) and knowledge maintenance (B2).

C: Runtime Adaptability – environment perturbations (C1) and result verification (C2).

Agents often succeed on simple API calls but fail when they must maintain consistent state across services, infer hidden user intents, or adapt to runtime changes.

Full‑Stack Mock Fidelity

Unlike benchmarks that use isolated endpoint stubs, LiveClawBench provides a reproducible full‑stack mock environment where agents interact with browsers, service APIs, file systems, databases, and audit logs. Success is measured by the final software state (e.g., email actually sent, calendar conflict resolved), not merely by correct API selection.

Trajectory Analysis

LiveClawBench extracts behavior metrics from execution traces, including step count, tool‑call density, repeated calls, tool diversity, error recovery, verification actions, blind‑edit rate, and termination patterns. Results show distinct failure modes per complexity factor:

Cross‑service dependencies cause excessive tool calls without successful state integration.

Implicit‑goal tasks lead to premature termination after meeting surface requirements.

Runtime perturbations result in missing post‑action verification.

Diagnostic Value

The benchmark quantifies overall capability, the complexity‑factor system isolates sources of difficulty, the full‑stack mock preserves verifiable execution evidence, and trajectory analysis explains how different pressures alter agent behavior. Observed patterns suggest concrete improvement directions: robust multi‑service orchestration, proactive implicit‑goal confirmation, reliable detection of polluted states, deeper knowledge maintenance, stricter self‑check, and safer side‑effect control.

References

Paper: LiveClawBench: Benchmarking LLM Agents on Complex, Real‑World Assistant Tasks

arXiv: https://arxiv.org/abs/2604.13072

GitHub: https://github.com/Mosi-AI/LiveClawBench

Trajectory dataset: https://huggingface.co/datasets/Mosi-AI/LiveClawbench-trajectories

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM AI Agent benchmark Complexity Analysis Personal Assistant LiveClawBench Full-stack Mock

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.