How SkillClaw Enables Collective Evolution of Agent Skills in Real-World Use
SkillClaw introduces a centralized evolution framework that transforms user interactions into structured evidence, allowing LLM agents to refine, create, or skip skills based on aggregated success and failure patterns, with nightly validation ensuring only proven improvements are deployed, resulting in consistent performance gains across diverse tasks.
Motivation
Current LLM agents (e.g., OpenClaw) rely on reusable Skills to accomplish complex tasks. After a skill is installed from a Skill Hub, the agent can invoke the structured workflow to coordinate tools and perform multi‑step reasoning. However, once deployed a skill remains essentially static. When an agent encounters failures—such as parameter‑format errors, incorrect tool‑call order, or missing environment configuration—it may eventually discover a fix through trial‑and‑error, but that improvement stays only within the current session and is never persisted to the skill library or shared with other users.
In essence, each user independently "re‑discovers" the same solution, preventing system‑level knowledge accumulation.
SkillClaw addresses the problem of continuously evolving agent skills during real usage and turning a single user’s experience into a shared capability for the whole system.
Core Idea: Collective Evolution Loop
SkillClaw proposes a centralized evolution architecture that treats multi‑user interactions as the primary signal for skill improvement.
User Interaction → Session Collection → Skill Evolution → Verification → Sync Deployment → Next Interaction2.1 From Isolated Sessions to Shared Evidence
Each interaction session is converted into a structured trajectory , preserving the full causal chain:
User Prompt → Agent Action → Environment Feedback → … → Final ResponseWhen different users invoke the same skill in varied contexts, the resulting success/failure patterns constitute a natural "ablation experiment" for that skill’s behavior. Aggregating evidence across users reveals a stable direction for evolution.
G(s) : all sessions that called skill s
G(∅) : sessions that called no skill (used to discover missing reusable processes)
Agentic Evolver: Open‑Reasoning‑Driven Skill Updates
The heart of SkillClaw is the Agentic Evolver , an LLM agent equipped with a structured harness that updates the shared skill library.
Given a skill s and its session group G(s) , the Evolver performs one of three operations:
Refine : fix the skill based on failure patterns to improve robustness.
Create : when a reusable sub‑process is not covered by existing skills, generate a new skill.
Skip : keep the skill unchanged when evidence is insufficient.
The Evolver always analyzes both successful and failed sessions. Successful sessions define the skill’s "invariants" (what must be retained), while failed sessions define the "target" (what needs to be corrected). This design avoids the common pitfall of fixing one bug while introducing several new ones.
Algorithm 1 : Convert user sessions to structured evidence, group by skill, let the Evolver infer patterns and generate candidate updates, then apply conservative editing and verification before merging into the shared library.
Night‑time Validation: Deploy Only Proven Improvements
Candidate evolved skills are not deployed immediately. They first enter a night‑time validation stage:
Select relevant validation tasks from the day’s interaction data.
Execute the old skill s and the new candidate skill s' side‑by‑side in a real environment.
Compare overall task success rate and execution stability.
Accept only if s' demonstrably outperforms s ; otherwise reject.
This guarantees monotonic deployment: the deployed skill pool never degrades, and users always interact with the best skill pool that passed the previous night’s validation.
Experiment: Six‑Day Evolution on WildClawBench
5.1 Benchmark
WildClawBench contains 60 complex real‑world tasks across six domains. Each task runs in a full Linux container with a toolchain, accepts multimodal inputs (text, code, image, video), imposes strict error penalties (critical errors receive zero score), and requires 15‑50 interaction steps.
5.2 Experimental Setup
Model: Qwen3‑Max
Users: 8 concurrent users
Period: 6 days (6 day‑night cycles)
Mechanism: daytime user interaction → night‑time evolution + validation → next‑day deployment
5.3 Main Result: Steady Performance Gains
Day‑by‑day user‑side results show monotonic improvement from Day 1 (baseline) to Day 6 (best‑validated skill pool). Key observations per domain:
Social Interaction : improvement already on Day 2, indicating a high‑impact workflow bottleneck that, once fixed, benefits all users.
Search Retrieval : stepwise improvement—first fixing input validation, then adding higher‑level retrieval planning.
Creative Synthesis : largest early jump (+88 %); bottleneck lies in environment configuration and file handling rather than content generation.
Safety Alignment : later improvement focusing on execution reliability (Git rollback, directory cloning protocols).
5.4 Night‑time Evolution Details
Evolution trajectories differ across domains:
Social Interaction : only task 03_task6 (cross‑department Slack summary) was accepted on Night 1. The update rewrote a descriptive command into a strict ordered workflow, causing a performance surge.
Search Retrieval : two‑stage evolution—Night 1 accepted validate-file-existence (file‑existence pre‑check); Night 3 accepted best-so-far confirmation (current‑best confirmation).
Creative Synthesis : only Night 1’s validate-tmp-workspace-inputs was accepted, which verifies temporary workspace inputs and environment settings.
5.5 Controlled Validation
On three custom queries, a single‑round evolution yielded an average gain of +42.1 % :
Query Baseline After Evolution Gain
------------------------------------------------
Base Extraction 21.7% 69.6% +47.8%
Deadline Parsing 41.1% 48.0% +6.9%
Save Report 28.3% 100.0% +71.7%Insight: when failures stem from missing or incorrect procedural knowledge, skill evolution is especially effective; tasks that rely on subtle reasoning are less sensitive to procedural updates.
Case Studies: How Evolution Changes Agent Behavior
Case 2 – ICCV 2025 Paper Statistics (Precision Boost)
The original agent used heuristic matching of university names. After evolution, the skill adopted a strict "first‑unit" definition based on the official PDF header and performed directed re‑verification on ambiguous cases, dramatically improving extraction precision.
Precise task definition : replace fuzzy matching with a strict structural rule.
Verification reasoning : explicitly re‑check uncertain cases.
Robust extraction : combine automatic parsing with targeted verification.
Case 4 – Multi‑Condition Mobile Phone Selection (Constraint‑Aware Decision)
The original agent relied on loose search and heuristic matching. The evolved skill introduced a structured, constraint‑aware workflow: each condition is explicitly verified, candidates are jointly evaluated, and when no full match exists the system reports partial matches and decomposes them.
Constraint‑aware reasoning : explicit multi‑condition verification before decision.
Grounded retrieval : prioritize authoritative sources over generic results.
Calibrated decision : acknowledge uncertainty and avoid over‑interpreting partial matches.
https://arxiv.org/pdf/2604.08377
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Github: https://github.com/AMAP-ML/SkillClawHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
