Artificial Intelligence 9 min read

Training Only the Skill Document While Keeping Model Weights Frozen (SkillOpt)

Microsoft Research introduces SkillOpt, a method that freezes large‑model weights and instead trains a natural‑language skill document as the sole learnable parameter, using a rollout‑reflect‑edit‑gate loop, achieving optimal results across 52 benchmark‑model‑environment combinations and demonstrating strong transferability.

AI Engineering

May 26, 2026

Training Only the Skill Document While Keeping Model Weights Frozen (SkillOpt)

Core Idea: Treat the Skill Document as Trainable Parameters

The large model and its agent remain frozen; the only mutable component is the skill.md file. SkillOpt translates the entire deep‑learning training pipeline into text space: rollout corresponds to forward pass, reflect to backward pass, edit budget to learning rate, and it includes mini‑batch, epoch, momentum, and slow‑update mechanisms.

Training Loop

Rollout : The target model executes tasks using the current skill, recording trajectories with scores.

Reflect : A separate optimizer model examines successful and failed batches to discover reusable patterns.

Edit : Candidate edits (add, delete, replace) are generated under an edit‑budget constraint.

Gate : Edits are accepted only if they improve performance on a held‑out validation set.

SkillOpt pipeline showing rollout, reflection, bounded edits, validation gate, slow update, and meta skill.

Stability Design

Key mechanisms include an edit budget to prevent a single rewrite from erasing good rules, a buffer that stores rejected edits as negative feedback, and slow updates plus a meta‑skill optimizer at the end of each epoch to provide long‑term signals. Deployment uses only the final skill document, incurring no extra inference overhead.

Results: 52/52 Combinations Win

Across six benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMath, ALFWorld), seven target models (GPT‑5.5/5.4/5.4‑mini/5.4‑nano/5.2, Qwen3.5‑4B, Qwen3.6‑35B‑A3B), and three execution environments (direct dialogue, Codex, Claude Code), SkillOpt achieved the best or tied‑best score in every one of the 52 settings, outperforming baselines such as Human skill, one‑shot LLM skill, Trace2Skill, TextGrad, GEPA, and EvoSkill.

Notable gains include: GPT‑5.5 improves by 23.5 points in direct dialogue, 21.8 points with Codex, and 18.6 points with Claude Code; the small GPT‑5.4‑nano model gains 35.1 points on ALFWorld.

Transferability

Skills trained on GPT‑5.4 for LiveMath increase GPT‑5.4‑nano performance by +15.2.

SpreadsheetBench skills learned with Codex raise Claude Code scores by +31.8.

When GPT‑5.4‑nano acts as its own optimizer, SpreadsheetBench improves by +10.4.

The exported best_skill.md is a reusable artifact that is not tied to any specific model or harness.

Additional Details

To avoid over‑fitting, training, validation, and test splits are disjoint; for SearchQA the split follows a 2:1:7 ratio, reserving 70 % of data for final evaluation, and cross‑model, cross‑harness, and cross‑benchmark transfer experiments further validate robustness.

A concurrent paper with the same name was submitted to the Agent Skills 2026 workshop; the authors note the original draft was titled “Skill as LoRA”, treating the skill as a LoRA‑style PEFT module.

Future work aims to package SkillOpt as an easy‑to‑use agent‑learning framework comparable to MMDetection or Detectron in computer vision.

Getting Started

The code is open‑source on GitHub and requires Python 3.10+. Quick start:

git clone https://github.com/microsoft/SkillOpt.git
cd SkillOpt
pip install -e .
# Optional ALFWorld benchmark support
pip install -e ".[alfworld]"
alfworld-download

Configure your API key (Azure OpenAI, native OpenAI, Anthropic Claude, or local vLLM deployments are supported, with Azure OpenAI recommended):

cp .env.example .env
# Edit .env to add your API key, then
source .env

Prepare data under train/, val/, and test/ directories following the JSON schema defined in skillopt/envs/<benchmark>/dataloader.py. Supported benchmarks include SearchQA, ALFWorld, DocVQA, LiveMathematicianBench, SpreadsheetBench, and OfficeQA.

Example training command for SearchQA:

python scripts/train.py \
    --config configs/searchqa/default.yaml \
    --split_dir /path/to/your/searchqa_split \
    --azure_openai_endpoint https://your-resource.openai.azure.com/ \
    --optimizer_model gpt-5.5 \
    --target_model gpt-5.5

The system supports checkpoint resumption; re‑running the same command continues from the last completed step. After training, the best skill document appears as best_skill.md in the output directory.

outputs/<run_name>/
├── config.json
├── history.json
├── runtime_state.json
├── best_skill.md
├── skills/skill_vXXXX.md
├── steps/step_XXXX/
├── slow_update/epoch_XX/
└── meta_skill/epoch_XX/

Evaluation‑only mode:

# Evaluate on test split only
python scripts/eval_only.py \
  --config configs/searchqa/default.yaml \
  --skill outputs/my_run/best_skill.md \
  --split valid_unseen \
  --split_dir /path/to/searchqa_split \
  --azure_openai_endpoint https://your-resource.openai.azure.com/

An optional WebUI can be launched for monitoring:

pip install -e ".[webui]"
python -m skillopt_webui.app --share

The default port is 7860; it can be changed, and a public share link can be created.

Links

Project page: https://microsoft.github.io/SkillOpt/

Paper: https://arxiv.org/abs/2605.23904

Code: https://github.com/microsoft/SkillOpt

Demo video: https://youtu.be/JUBMDTCiM0M

As AI agents shift from assistant to worker roles, the bottleneck moves from knowledge to procedural capability—how to use tools, inspect intermediate states, and recover from failures. Explicitly writing these capabilities as trainable, inspectable, and transferable skill documents may be more engineering‑friendly than embedding them directly in model weights.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transfer Learning LLM agents parameter-efficient fine-tuning benchmark evaluation skill documents SkillOpt

Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.