Artificial Intelligence 11 min read

Can Agent Skills Be Trained Like Neural Networks? SkillOpt Demonstrates Success

SkillOpt treats an agent’s Skill document as a trainable external state, applying classic deep‑learning tools such as epochs, batch size, learning rate and validation gating, and in experiments across 52 benchmark units it lifts GPT‑5.5 performance by an average of 23.5 points while enabling cross‑model and cross‑environment transfer with no additional inference cost.

SuanNi

May 27, 2026

Can Agent Skills Be Trained Like Neural Networks? SkillOpt Demonstrates Success

Problem and Motivation

Agent skills are typically obtained via three methods—human‑written manuals, one‑shot LLM generation, or self‑evolution from execution traces. All three suffer from the absence of a proper optimizer, leading to brittle execution strategies such as searching the wrong source, formatting incorrectly, or applying weak conclusions despite correct reasoning.

SkillOpt Training Loop

SkillOpt treats the Skill document as a frozen external state of the agent and uses an independent optimizer model to edit it. The training loop mirrors conventional deep‑learning training:

Forward pass : The target model runs a batch of tasks with the current Skill, collecting trajectories and scores.

Backward pass : The optimizer separates successful and failed trajectories, reflects on them in small batches, and generates structured edit operations (add, delete, replace).

Bounded text update : Edits are ranked by expected utility and only the top Lt (the text learning rate) are applied, preserving continuity of Skill versions.

Validation gating : Each candidate Skill is evaluated on a held‑out selection split; only edits that strictly improve the score are accepted.

Reject‑edit buffer : Rejected edits are stored within the epoch, providing negative feedback for subsequent reflections.

Epoch‑level slow/fast updates : Fast updates learn from the current batch, while slow updates aggregate trends across epochs, classifying outcomes into improvement, regression, persistent failure, or stable success.

Evaluation Results

Experiments cover six benchmarks, seven target models, and three execution modes (direct dialogue, Codex harness, Claude Code harness), totaling 52 evaluation units. Using GPT‑5.5 in direct dialogue, the average score rises from 58.8 (no Skill) to 82.3 (+23.5), outperforming the strongest baseline by 5.4 points. Notable gains include SpreadsheetBench (+38.9), OfficeQA (+39.0), LiveMath (+29.3). Smaller models benefit even more: GPT‑5.4‑nano on DocVQA improves by +49.4, and Qwen3.5‑4B on ALFWorld improves by +50.7.

In the Codex harness, SkillOpt raises SpreadsheetBench from 67.5 to 85.0 (+17.5). In the Claude Code harness, average gains over the EvoSkill baseline are +14.0 and +3.2 points respectively.

Transferability

SkillOpt’s optimized Skills were evaluated in three transfer scenarios:

Cross‑model transfer : A Skill optimized for GPT‑5.4 improves GPT‑5.4‑mini (+9.4) and GPT‑5.4‑nano (+3.0); a LiveMath Skill transferred to GPT‑5.4‑mini (+4.5) and GPT‑5.4‑nano (+5.6), surpassing on‑target optimization.

Cross‑environment transfer : A Skill trained in the Codex environment applied to Claude Code raises the score from 22.1 to 81.8 (+59.7); the reverse transfer adds +43.6, despite differing tool APIs.

Cross‑benchmark transfer : A Math Skill tuned on OlympiadBench improves Omni‑MATH by +1.8 to +3.7 across three models, demonstrating reusable procedural knowledge.

Optimizer Strength and Training Cost

A stronger offline optimizer (GPT‑5.5) wins more units than a model‑matched optimizer, which nevertheless recovers 56 %–74 % of the gains. Token cost per point varies widely: SpreadsheetBench requires 0.6 M tokens per point, OfficeQA 1.1 M, LiveMath 3.6 M, while SearchQA and DocVQA need 37.9 M and 46.4 M respectively due to longer multimodal trajectories.

Optimized Skill documents range from 379 to 1,995 tokens (median ≈ 920). Accepted edits per Skill are 1–4 (median 2.5). Large score jumps often stem from a single accepted edit (e.g., LiveMath +29.3, OfficeQA +39.0).

All training costs are incurred once; deployment adds only a static text file, incurring zero additional inference overhead.

References

https://github.com/microsoft/SkillOpt

https://arxiv.org/pdf/2605.23904

https://microsoft.github.io/SkillOpt/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM benchmark evaluation Agent Skill Deep Learning Optimization SkillOpt cross‑model transfer

Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.