Artificial Intelligence 10 min read

SkillOS: How Skill Governance Powers Self‑Evolving AI Agents

SkillOS addresses the one‑off nature of current LLM agents by introducing a closed‑loop system where a trainable Skill Curator continuously extracts, updates, and manages reusable skills from execution traces, leading to measurable gains in success rates, efficiency, and cross‑task generalization.

PaperAgent

May 11, 2026

SkillOS: How Skill Governance Powers Self‑Evolving AI Agents

Current LLM‑based agents act as one‑off problem solvers; they start from scratch for each new task and cannot accumulate experience.

1. Receive task → 2. Retrieve relevant Skill → 3. Execute task → 4. Extract experience from trajectory → 5. Update Skill repository → back to step 1

The critical bottleneck in this loop is Skill Curation —extracting high‑quality, reusable skills from massive interaction traces.

Existing approaches suffer from three major limits: manual curation (e.g., Anthropic’s Skills Repository) that does not scale; heuristic‑rule memory operations lacking downstream performance feedback; and short‑horizon training that provides sparse signals for complex operations such as update or delete.

System Architecture: Decoupled Executor and Curator

SkillOS adopts a dual‑module design:

Agent Executor (π_L) : Executes actions based on current task observations and retrieved skills; its parameters are frozen.

Skill Curator (π_S) : Evaluates skill quality from execution traces and performs Insert/Update/Delete; it is trainable via GRPO.

Skills follow Anthropic’s SKILL.md format: a YAML front‑matter containing the skill name and retrieval cues, and a Markdown body with workflow, constraints, and usage notes.

Training Strategy: Grouped Task Streams + Composite Rewards

Grouped Task Streams

Training data are organized into related task groups using Gemini‑2.5‑Pro annotations (Topic/Skill/Concept/Strategy/Pitfall) for similarity clustering. Each training step starts from an empty SkillRepo, executes tasks sequentially, and updates the repo after each task, allowing early‑generated skills to be tested by later tasks.

Composite Reward Design

The curator’s objective combines four weighted signals:

Average success rate of subsequent tasks in the same group (delayed executor signal).

Function‑call legality ratio (ensures correct format).

Content quality scored by Qwen3‑32B as an external judge.

Compression rate (S_i) to penalize skill bloat.

Training uses Group Relative Policy Optimization (GRPO) to reduce variance by comparing rollouts within a group. Configuration: Qwen3‑8B base, 16 H100 GPUs, ~3 days on ALFWorld, ~2.5 days on reasoning tasks.

Experimental Results: Gains in Performance and Efficiency

Evaluated on multi‑turn agent tasks (ALFWorld, WebShop) and single‑turn reasoning tasks (AIME24/25, GPQA‑Diamond) with frozen executors Qwen3‑8B, Qwen3‑32B, Gemini‑2.5‑Pro.

ALFWorld

SkillOS raises average success rate from 55.7 % (ReasoningBank baseline) to 61.2 % on Qwen3‑8B, reducing interaction steps by 6 %. With Gemini‑2.5‑Pro the gain reaches +9.5 % (66.4 % → 80.2 %).

WebShop & Reasoning

Success rate improves to 16.5 % on WebShop, and reasoning accuracy gains +4.2 %. Gains are larger on agent tasks because trajectories provide richer reusable patterns.

Cross‑Domain Generalization

Although trained only with Qwen3‑8B, the curator transfers to Qwen3‑32B and Gemini‑2.5‑Pro with stable improvements. Skills learned for reasoning tasks generalize well to ALFWorld and WebShop, while task‑specific skills (e.g., WebShop click flow) transfer less.

In‑Depth Analysis: What the Curator Learned

Ablation Studies

Removing the content‑quality reward drops performance by 2.6 %; removing grouped training causes the largest drop of 3.9 %, confirming the necessity of learning curation within related task streams.

Evolution of Curator Operations

Early training is dominated by Insert actions. Over time, Update operations increase, shifting from expansion to refinement, while Delete remains low but grows, indicating the compression reward curbs skill bloat.

Structural Evolution of the Skill Library

Microscopically, early skills are generic “Tips” and “Guidance”; later they contain failure‑handling logic, conditional branches, and retry strategies. Macroscopically, the repository shifts from task‑specific skills to meta‑strategy skills (state verification, system search, recovery, alternatives) comprising >50 % of the library.

Skill Utilization Attribution

Compared with baselines, SkillOS achieves 100 % skill call rate (vs 87.9 %), higher proportion of tasks that successfully use skills (61.2 % vs 53.6 %), higher coverage (88.6 % vs 72.9 %), and fewer skills per task (1.95 vs 2.24), demonstrating more precise matching.

Curator operation distribution evolution

https://arxiv.org/pdf/2605.06614
SkillOS: Learning Skill Curation for Self-Evolving Agents

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

reinforcement learning LLM Agents Grouped Task Streams Meta-Strategy Skills Skill Curation SkillOS

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.