Artificial Intelligence 12 min read

SkillOpt: Enabling Self‑Evolving Agent Skills via Text‑Space Optimization

SkillOpt reframes LLM agent skills as trainable external state, applying a deep‑learning‑style optimizer to systematically improve skill documents, and demonstrates across six benchmarks, seven models, and three execution modes that this approach yields consistent, large gains and robust transferability.

PaperAgent

Jun 4, 2026

SkillOpt: Enabling Self‑Evolving Agent Skills via Text‑Space Optimization

Background and Motivation

Current large‑model agents suffer from awkward skill creation: skills are either hand‑crafted by experts, generated in a single LLM pass, or patched through loose self‑revision loops. All these methods lack the systematic, reproducible, and constrained optimization that deep‑learning optimizers provide.

Core Idea: Treat Skill as Trainable "External State"

SkillOpt introduces the notion of text‑space optimization , moving the full toolbox of deep‑learning optimizers to the domain of natural‑language skill documents. The authors map deep‑learning concepts to skill‑specific counterparts:

Parameters → skill document

Gradient → trajectory‑derived edit direction

Learning Rate → edit budget

Validation → held‑out selection gate

Momentum/Batch → epoch‑wise slow/meta update

This design separates concerns: the target model (Student) is frozen at deployment and only consumes the optimized skill text, while the optimizer model (Teacher) operates offline, adding no inference overhead.

A Deep‑Learning‑Style Optimization Loop

The full SkillOpt workflow consists of six rigorously defined stages:

Forward Pass – Rollout Evidence : The target model executes a batch of tasks using the current skill, recording full trajectories (tool calls, observations, final answers, validator feedback). The rollout batch size controls evidence noise.

Backward Pass – Minibatch Reflection : The optimizer partitions successful and failed trajectories into reflection minibatches . Small batches expose reusable procedural errors (e.g., "Agent always queries the wrong data source").

Bounded Text Updates : Unlike unconstrained rewriting, edits are limited by a text‑learning‑rate budget (default cosine decay from 4 to 2). The optimizer ranks pooled edits by expected utility and retains only the top‑k.

Validation Gate : Each candidate skill must pass a held‑out selection split; only edits with a strictly higher score than the current selection are accepted (ties are rejected). Accepted candidates become the new current skill, and the best version is saved to best_skill.md.

Rejected‑Edit Buffer : Rejected edits are stored in an epoch‑local buffer that records failure patterns. Future optimizer calls consult this buffer to avoid repeating harmful edits, forming a closed‑loop negative feedback.

Epoch‑Wise Slow/Meta Update : Fast updates learn from the current batch, while slow/meta updates capture cross‑epoch regularities. After each epoch, the optimizer classifies outcomes as improvement, degradation, persistent failure, or stable success, and writes a protected slow‑update field (delimited by  and ) into the skill document. A separate meta skill summarizes effective versus harmful edit patterns for future guidance.

Experimental Results: 52/52 Win Record

SkillOpt was evaluated on six benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), seven target models ranging from GPT‑5.5 to Qwen‑3.5‑4B, and three execution modes (Direct Chat, Codex, Claude Code). Key findings include:

GPT‑5.5 Direct Chat: average score rose from 58.8 to 82.3 (+23.5), surpassing the strongest per‑cell baseline (76.9) by +5.4.

Programmatic tasks saw the largest gains: SpreadsheetBench 41.8→80.7 (+38.9), OfficeQA 33.1→72.1 (+39.0), LiveMathematicianBench 37.6→66.9 (+29.3).

Smaller models benefited more: GPT‑5.4‑nano on DocVQA nearly doubled (30.8→80.2), and on ALFWorld improved >2× (34.3→69.4).

Tool‑harness modes: Codex harness averaged +24.8, Claude Code harness +19.1, both clearly outperforming EvoSkill and other harness competitors.

Ablation Studies: Verifying the Deep‑Learning Analogy

Tables 2 and 3 (shown as figures) perform controlled variable experiments to confirm the necessity of each component.

Evidence Volume : Procedural benchmarks (SpreadsheetBench, LiveMath) improve steadily with more training data, while SearchQA saturates after 20%.

Batch‑Size Robustness : Reflection minibatch sizes from 1 to 32 keep SearchQA variance within ±1.5 points; default Bm=8 is near optimal.

Learning‑Rate Schedule : Lt=4 is stable; constant, cosine, and linear schedules all work, indicating that bounded updates matter more than exact hyper‑parameters.

Rejected Buffer : Removing it drops SpreadsheetBench by 4.6 points, confirming its stabilizing role.

Slow/Meta Update : The most critical component; removing meta skill and slow update collapses SpreadsheetBench from 77.5 to 55.0 (‑22.5), showing that cross‑epoch guidance prevents local edits from erasing long‑term procedural knowledge.

Transferability: Skills as Portable Assets

Optimized skills are not over‑fitted prompts; they are auditable, reusable text artifacts. Three transfer scenarios were tested:

Cross‑Model Transfer : SpreadsheetBench skill trained on GPT‑5.4 improves GPT‑5.4‑mini by +9.4 (82% of in‑domain gain) and GPT‑5.4‑nano by +3.0.

Cross‑Harness Transfer : Codex‑trained skill applied to Claude Code yields an absolute gain of +59.7 (22.1→81.8), surpassing Claude Code’s own in‑domain reference, indicating the learned rules are procedural knowledge rather than hard‑coded CLI commands.

Cross‑Benchmark Transfer : Skill transferred from OlympiadBench to Omni‑MATH improves three model scales by +1.3 to +3.7, proving the skill encodes reusable mathematical reasoning procedures.

Conclusion

SkillOpt demonstrates that treating agent skills as trainable external state and applying a deep‑learning‑style optimizer creates measurable value beyond simple model distillation. The systematic optimization loop, bounded updates, validation gate, and meta‑skill guidance together enable robust performance gains, effective ablations, and transferable skill assets across models, harnesses, and tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language model benchmark evaluation Agent Skills self-evolving agents SkillOpt Text‑Space Optimization

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.