SkillOS: How Skill Governance Powers Self‑Evolving AI Agents
SkillOS addresses the one‑off nature of current LLM agents by introducing a closed‑loop system where a trainable Skill Curator continuously extracts, updates, and manages reusable skills from execution traces, leading to measurable gains in success rates, efficiency, and cross‑task generalization.
Current LLM‑based agents act as one‑off problem solvers; they start from scratch for each new task and cannot accumulate experience.
1. Receive task → 2. Retrieve relevant Skill → 3. Execute task → 4. Extract experience from trajectory → 5. Update Skill repository → back to step 1
The critical bottleneck in this loop is Skill Curation —extracting high‑quality, reusable skills from massive interaction traces.
Existing approaches suffer from three major limits: manual curation (e.g., Anthropic’s Skills Repository) that does not scale; heuristic‑rule memory operations lacking downstream performance feedback; and short‑horizon training that provides sparse signals for complex operations such as update or delete.
System Architecture: Decoupled Executor and Curator
SkillOS adopts a dual‑module design:
Agent Executor (π_L) : Executes actions based on current task observations and retrieved skills; its parameters are frozen.
Skill Curator (π_S) : Evaluates skill quality from execution traces and performs Insert/Update/Delete; it is trainable via GRPO.
Skills follow Anthropic’s SKILL.md format: a YAML front‑matter containing the skill name and retrieval cues, and a Markdown body with workflow, constraints, and usage notes.
Training Strategy: Grouped Task Streams + Composite Rewards
Grouped Task Streams
Training data are organized into related task groups using Gemini‑2.5‑Pro annotations (Topic/Skill/Concept/Strategy/Pitfall) for similarity clustering. Each training step starts from an empty SkillRepo, executes tasks sequentially, and updates the repo after each task, allowing early‑generated skills to be tested by later tasks.
Composite Reward Design
The curator’s objective combines four weighted signals:
Average success rate of subsequent tasks in the same group (delayed executor signal).
Function‑call legality ratio (ensures correct format).
Content quality scored by Qwen3‑32B as an external judge.
Compression rate (S_i) to penalize skill bloat.
Training uses Group Relative Policy Optimization (GRPO) to reduce variance by comparing rollouts within a group. Configuration: Qwen3‑8B base, 16 H100 GPUs, ~3 days on ALFWorld, ~2.5 days on reasoning tasks.
Experimental Results: Gains in Performance and Efficiency
Evaluated on multi‑turn agent tasks (ALFWorld, WebShop) and single‑turn reasoning tasks (AIME24/25, GPQA‑Diamond) with frozen executors Qwen3‑8B, Qwen3‑32B, Gemini‑2.5‑Pro.
ALFWorld
SkillOS raises average success rate from 55.7 % (ReasoningBank baseline) to 61.2 % on Qwen3‑8B, reducing interaction steps by 6 %. With Gemini‑2.5‑Pro the gain reaches +9.5 % (66.4 % → 80.2 %).
WebShop & Reasoning
Success rate improves to 16.5 % on WebShop, and reasoning accuracy gains +4.2 %. Gains are larger on agent tasks because trajectories provide richer reusable patterns.
Cross‑Domain Generalization
Although trained only with Qwen3‑8B, the curator transfers to Qwen3‑32B and Gemini‑2.5‑Pro with stable improvements. Skills learned for reasoning tasks generalize well to ALFWorld and WebShop, while task‑specific skills (e.g., WebShop click flow) transfer less.
In‑Depth Analysis: What the Curator Learned
Ablation Studies
Removing the content‑quality reward drops performance by 2.6 %; removing grouped training causes the largest drop of 3.9 %, confirming the necessity of learning curation within related task streams.
Evolution of Curator Operations
Early training is dominated by Insert actions. Over time, Update operations increase, shifting from expansion to refinement, while Delete remains low but grows, indicating the compression reward curbs skill bloat.
Structural Evolution of the Skill Library
Microscopically, early skills are generic “Tips” and “Guidance”; later they contain failure‑handling logic, conditional branches, and retry strategies. Macroscopically, the repository shifts from task‑specific skills to meta‑strategy skills (state verification, system search, recovery, alternatives) comprising >50 % of the library.
Skill Utilization Attribution
Compared with baselines, SkillOS achieves 100 % skill call rate (vs 87.9 %), higher proportion of tasks that successfully use skills (61.2 % vs 53.6 %), higher coverage (88.6 % vs 72.9 %), and fewer skills per task (1.95 vs 2.24), demonstrating more precise matching.
https://arxiv.org/pdf/2605.06614
SkillOS: Learning Skill Curation for Self-Evolving AgentsSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
