Artificial Intelligence 13 min read

How SkillClaw Enables Collective Evolution of Agent Skills in Real-World Use

SkillClaw introduces a centralized evolution framework that transforms user interactions into structured evidence, allowing LLM agents to refine, create, or skip skills based on aggregated success and failure patterns, with nightly validation ensuring only proven improvements are deployed, resulting in consistent performance gains across diverse tasks.

PaperAgent

Apr 22, 2026

How SkillClaw Enables Collective Evolution of Agent Skills in Real-World Use

Motivation

Current LLM agents (e.g., OpenClaw) rely on reusable Skills to accomplish complex tasks. After a skill is installed from a Skill Hub, the agent can invoke the structured workflow to coordinate tools and perform multi‑step reasoning. However, once deployed a skill remains essentially static. When an agent encounters failures—such as parameter‑format errors, incorrect tool‑call order, or missing environment configuration—it may eventually discover a fix through trial‑and‑error, but that improvement stays only within the current session and is never persisted to the skill library or shared with other users.

In essence, each user independently "re‑discovers" the same solution, preventing system‑level knowledge accumulation.

SkillClaw addresses the problem of continuously evolving agent skills during real usage and turning a single user’s experience into a shared capability for the whole system.

Core Idea: Collective Evolution Loop

SkillClaw proposes a centralized evolution architecture that treats multi‑user interactions as the primary signal for skill improvement.

User Interaction → Session Collection → Skill Evolution → Verification → Sync Deployment → Next Interaction

2.1 From Isolated Sessions to Shared Evidence

Each interaction session is converted into a structured trajectory , preserving the full causal chain:

User Prompt → Agent Action → Environment Feedback → … → Final Response

When different users invoke the same skill in varied contexts, the resulting success/failure patterns constitute a natural "ablation experiment" for that skill’s behavior. Aggregating evidence across users reveals a stable direction for evolution.

G(s) : all sessions that called skill s

G(∅) : sessions that called no skill (used to discover missing reusable processes)

Agentic Evolver: Open‑Reasoning‑Driven Skill Updates

The heart of SkillClaw is the Agentic Evolver , an LLM agent equipped with a structured harness that updates the shared skill library.

Given a skill s and its session group G(s) , the Evolver performs one of three operations:

Refine : fix the skill based on failure patterns to improve robustness.

Create : when a reusable sub‑process is not covered by existing skills, generate a new skill.

Skip : keep the skill unchanged when evidence is insufficient.

The Evolver always analyzes both successful and failed sessions. Successful sessions define the skill’s "invariants" (what must be retained), while failed sessions define the "target" (what needs to be corrected). This design avoids the common pitfall of fixing one bug while introducing several new ones.

Algorithm 1 : Convert user sessions to structured evidence, group by skill, let the Evolver infer patterns and generate candidate updates, then apply conservative editing and verification before merging into the shared library.

Night‑time Validation: Deploy Only Proven Improvements

Candidate evolved skills are not deployed immediately. They first enter a night‑time validation stage:

Select relevant validation tasks from the day’s interaction data.

Execute the old skill s and the new candidate skill s' side‑by‑side in a real environment.

Compare overall task success rate and execution stability.

Accept only if s' demonstrably outperforms s ; otherwise reject.

This guarantees monotonic deployment: the deployed skill pool never degrades, and users always interact with the best skill pool that passed the previous night’s validation.

Experiment: Six‑Day Evolution on WildClawBench

5.1 Benchmark

WildClawBench contains 60 complex real‑world tasks across six domains. Each task runs in a full Linux container with a toolchain, accepts multimodal inputs (text, code, image, video), imposes strict error penalties (critical errors receive zero score), and requires 15‑50 interaction steps.

5.2 Experimental Setup

Model: Qwen3‑Max

Users: 8 concurrent users

Period: 6 days (6 day‑night cycles)

Mechanism: daytime user interaction → night‑time evolution + validation → next‑day deployment

5.3 Main Result: Steady Performance Gains

Day‑by‑day user‑side results show monotonic improvement from Day 1 (baseline) to Day 6 (best‑validated skill pool). Key observations per domain:

Social Interaction : improvement already on Day 2, indicating a high‑impact workflow bottleneck that, once fixed, benefits all users.

Search Retrieval : stepwise improvement—first fixing input validation, then adding higher‑level retrieval planning.

Creative Synthesis : largest early jump (+88 %); bottleneck lies in environment configuration and file handling rather than content generation.

Safety Alignment : later improvement focusing on execution reliability (Git rollback, directory cloning protocols).

5.4 Night‑time Evolution Details

Evolution trajectories differ across domains:

Social Interaction : only task 03_task6 (cross‑department Slack summary) was accepted on Night 1. The update rewrote a descriptive command into a strict ordered workflow, causing a performance surge.

Search Retrieval : two‑stage evolution—Night 1 accepted validate-file-existence (file‑existence pre‑check); Night 3 accepted best-so-far confirmation (current‑best confirmation).

Creative Synthesis : only Night 1’s validate-tmp-workspace-inputs was accepted, which verifies temporary workspace inputs and environment settings.

5.5 Controlled Validation

On three custom queries, a single‑round evolution yielded an average gain of +42.1 % :

Query            Baseline   After Evolution   Gain
------------------------------------------------
Base Extraction   21.7%        69.6%          +47.8%
Deadline Parsing  41.1%        48.0%          +6.9%
Save Report       28.3%       100.0%          +71.7%

Insight: when failures stem from missing or incorrect procedural knowledge, skill evolution is especially effective; tasks that rely on subtle reasoning are less sensitive to procedural updates.

Case Studies: How Evolution Changes Agent Behavior

Case 2 – ICCV 2025 Paper Statistics (Precision Boost)

The original agent used heuristic matching of university names. After evolution, the skill adopted a strict "first‑unit" definition based on the official PDF header and performed directed re‑verification on ambiguous cases, dramatically improving extraction precision.

Precise task definition : replace fuzzy matching with a strict structural rule.

Verification reasoning : explicitly re‑check uncertain cases.

Robust extraction : combine automatic parsing with targeted verification.

Case 4 – Multi‑Condition Mobile Phone Selection (Constraint‑Aware Decision)

The original agent relied on loose search and heuristic matching. The evolved skill introduced a structured, constraint‑aware workflow: each condition is explicitly verified, candidates are jointly evaluated, and when no full match exists the system reports partial matches and decomposes them.

Constraint‑aware reasoning : explicit multi‑condition verification before decision.

Grounded retrieval : prioritize authoritative sources over generic results.

Calibrated decision : acknowledge uncertainty and avoid over‑interpreting partial matches.

Figure 1: SkillClaw overall architecture

Figure 2: SkillClaw architecture diagram

Figure 3: ICCV 2025 Paper Statistics Case

Figure 5: Multi‑Condition Mobile Phone Selection Case

https://arxiv.org/pdf/2604.08377
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Github: https://github.com/AMAP-ML/SkillClaw

benchmark LLM agents AI workflow Skill Evolution collective learning nightly validation

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Motivation

Core Idea: Collective Evolution Loop

2.1 From Isolated Sessions to Shared Evidence

Agentic Evolver: Open‑Reasoning‑Driven Skill Updates

Night‑time Validation: Deploy Only Proven Improvements

Experiment: Six‑Day Evolution on WildClawBench

5.1 Benchmark

5.2 Experimental Setup

5.3 Main Result: Steady Performance Gains

5.4 Night‑time Evolution Details

5.5 Controlled Validation

Case Studies: How Evolution Changes Agent Behavior

Case 2 – ICCV 2025 Paper Statistics (Precision Boost)

Case 4 – Multi‑Condition Mobile Phone Selection (Constraint‑Aware Decision)

PaperAgent

How this landed with the community

Was this worth your time?

0 Comments

Case 2 – ICCV 2025 Paper Statistics (Precision Boost)

Case 4 – Multi‑Condition Mobile Phone Selection (Constraint‑Aware Decision)