What Problems Does Microsoft SkillOpt Solve for Self‑Evolving Agent Skills?

SkillOpt treats an agent's skill file as a trainable text document, iteratively editing it with a frozen LLM to boost performance on tasks with known correct answers, achieving large benchmark gains without fine‑tuning model weights while remaining portable and auditable.

High Availability Architecture
High Availability Architecture
High Availability Architecture
What Problems Does Microsoft SkillOpt Solve for Self‑Evolving Agent Skills?

What is SkillOpt?

SkillOpt is a text‑space optimizer that treats a skill document (natural‑language instructions for an AI agent) as a trainable object. It runs a loop analogous to neural‑network training: a frozen target model executes the task using the current skill, an optimizer LLM proposes bounded edits, and a validation gate accepts only edits that improve the score on a held‑out validation set. The loop outputs a compact best_skill.md file.

Why is SkillOpt effective?

Two reasons: (1) it yields large gains without modifying model weights, and (2) the resulting skill files are portable across models and harnesses. In six benchmarks (search QA, spreadsheets, documents, multimodal QA, math, embodied‑agent) covering seven target models and three execution harnesses, SkillOpt achieved the best or tied‑best score in all 52 model‑benchmark‑harness cells. On GPT‑5.5 direct‑chat the average score rose from 58.8 to 82.3 (+23.5), beating the strongest baseline by +5.4 points. Notable single‑benchmark jumps: SpreadsheetBench 41.8 → 80.7, OfficeQA 33.1 → 72.1. A skill trained on the Codex harness transferred to Claude Code harness with a +59.7 point improvement.

Required inputs

SkillOpt expects a split directory with three JSON files:

data/my_split/
├── train/items.json   # examples for the optimizer
├── val/items.json     # held‑out validation set
└── test/items.json    # unseen test set (used only for final evaluation)

Each entry follows the SearchQA schema:

[
  {
    "id": "unique_item_id",
    "question": "Who wrote the novel ...",
    "context": "[DOC] relevant passage ...",
    "answers": ["expected answer"]
  }
]

For most tasks the SearchQA format works out of the box. The paper reports that 20‑40 examples are sufficient for an initial run; full experiments used a few hundred. If exact‑match scoring is insufficient, an LLM‑as‑judge scorer can be supplied, but only when necessary because noisy judges can destabilize the loop.

Installation

Clone the repository and install the package (optional web UI):

git clone https://github.com/microsoft/SkillOpt
pip install -e .[webui]   # optional dashboard

Documentation is available at https://microsoft.github.io/SkillOpt/.

Running a training run

The core command is scripts/train.py. Example using the built‑in SearchQA config:

python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/my_split \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5 \
    --num_epochs 4 \
    --batch_size 40 \
    --out_root outputs/my_first_run

Key flags:

--target_model – model that will use the final skill (e.g., OpenAI, Anthropic, or a local deployment).
--optimizer_model – LLM that proposes edits; used only during training, incurring no deployment cost.

Recommended workflow: start with a cheap run (1–2 epochs, batch size equal to the dataset size) to verify that validation scores improve before scaling up.

Output structure

outputs/my_first_run/
├── best_skill.md               # final skill to deploy
├── history.json                # per‑step training history
├── skills/skill_vXXXX.md       # snapshot after each edit
├── steps/step_XXXX/            # edit and evaluation details
├── slow_update/epoch_XX/       # cross‑epoch integration logs
└── meta_skill/epoch_XX/        # optimizer‑side notes (not deployed)

Re‑running the same command resumes from the last completed step.

Evaluation and deployment

After training, evaluate the skill on the held‑out test split:

python scripts/eval_only.py \
    --config configs/searchqa/default.yaml \
    --skill outputs/my_first_run/best_skill.md \
    --split valid_unseen \
    --split_dir /path/to/my_split

Compare the test score against a no‑skill baseline; the difference quantifies the gain. Deployment consists of adding best_skill.md to the agent’s system prompt or loading it as a procedural‑memory file. No additional model weights or runtime calls to the optimizer are required.

Hyperparameters and ablations

Bounded edits (text learning rate Lt) – limits edits per step (default Lt=4, decays to 2). Ablations show any moderate budget outperforms unlimited rewriting.

Validation gate – accepts an edit only if the validation score strictly increases; ties are rejected.

Rejected‑edit buffer – records edits that failed the gate so the optimizer avoids repeating them. Removing the buffer degrades performance.

Slow/meta updates – momentum‑style updates applied at epoch boundaries that preserve durable improvements. Ablations indicate that removing both slow and meta updates causes the largest single‑benchmark drop.

Cost considerations

Each epoch consumes API tokens. Reported token usage ranges from ~0.6 M for simple procedural tasks to ~46 M for large multimodal tasks. The cost is incurred only during training; inference with the final skill adds no token cost.

SkillOpt can improve only tasks with verifiable correct answers. Incorrect or inconsistent examples will steer the optimizer toward wrong solutions.

Long‑term benefits

The optimized skill is a portable, auditable artifact that can be version‑controlled, edited, and transferred across models, scales, and harnesses without retraining. It turns ad‑hoc prompt tweaking into a repeatable engineering asset.

References

Paper: https://arxiv.org/pdf/2605.23904

Repository: https://github.com/microsoft/SkillOpt

Demo site: https://microsoft.github.io/SkillOpt/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI Agentsprompt optimizationbenchmark evaluationSkillOptLLM training alternativetext-based skill learning
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.