Why Handwritten SKILL.md Fails: SkillOpt Trains Prompts and Wins All 52 Benchmarks

Microsoft's new SkillOpt paper shows that treating a hand‑written SKILL.md file as trainable parameters and iterating it 50+ times outperforms every human‑crafted version across 52 comparisons, delivering up to 24.8‑point gains in Claude Code, GPT‑5.5, and Codex environments.

Code Mala Tang
Code Mala Tang
Code Mala Tang
Why Handwritten SKILL.md Fails: SkillOpt Trains Prompts and Wins All 52 Benchmarks

Microsoft Research recently released a paper titled SkillOpt that challenges the common practice of manually writing SKILL.md files for AI agents. The authors demonstrate that a machine‑trained SKILL.md, obtained through 50+ optimization rounds, beats every human‑authored version in all 52 evaluated cases (six benchmarks, seven target models).

1. The Flaw in Manual Prompt Writing

Most engineers create SKILL.md by intuition: write a draft, run a few examples, add or delete rules when something looks wrong, and repeat until it "feels right." This process lacks an objective function, so developers cannot tell whether a change improves or harms performance.

"Most engineers hand‑craft agent skill documents based on experience and intuition, but the skill document itself should be trained like a parameter," the researchers note.

Consequently, prompt engineering remains a "craft" rather than an engineering problem.

2. What SkillOpt Actually Does

The core idea is to treat the SKILL.md file as if it were neural‑network weights and apply gradient‑descent‑style updates—except the gradients are expressed in natural language.

Step 1 – Optimizer reads the SKILL.md

The optimizer proposes three possible edits for each iteration: add a new rule, delete a redundant rule, or replace an inaccurate rule.

Step 2 – Validation on a held‑out set

After each edit the system runs a suite of test cases; if the edit fails the validation set, it is rolled back. This provides a quantitative metric for each change.

Step 3 – Controlling aggressiveness with a "text learning rate"

High learning rate → large changes per round

Low learning rate → small, incremental tweaks

The method even encodes concepts such as batch size and momentum in textual form.

3. Empirical Results

Across six benchmarks and seven target models, the machine‑trained SKILL.md outperformed human versions in all 52 measured dimensions, beating prior works like GEPA and TextGrad. Specific score improvements include:

+19.1 points in the Claude Code environment

+23.5 points in direct GPT‑5.5 dialogue

+24.8 points in the Codex loop

handwritten vs trained
handwritten vs trained

4. Why Machines Win

The authors identify three fundamental issues with manual SKILL.md:

Human test cases cover only 5‑10 typical scenarios, while the optimizer evaluates on hundreds of unseen cases each round, achieving far broader coverage.

Engineers are reluctant to delete rules, causing SKILL.md files to bloat; the optimizer freely removes or replaces rules when validation fails.

Developers cannot reliably judge the usefulness of individual rules; the optimizer’s ablation experiments on the validation set directly reveal which rules should stay or be removed.

three problems
three problems

5. Implications for Claude Code and Similar Tools

Short term: No immediate change; SkillOpt remains a research paper without open‑source release.

Mid term (6‑12 months): Tools that automate similar optimization cycles are likely to appear, allowing users to provide an initial SKILL.md, run dozens of iterations, and receive an optimized version with performance metrics (e.g., accuracy rising from 65 % to 84 %).

Long term: Manual prompt writing is expected to disappear, analogous to the shift from hand‑written SQL to ORMs or from raw HTML to component frameworks.

6. Practical Steps You Can Take Now

Build your own validation set: Collect 20‑30 real agent cases, run them after each SKILL.md edit, and record pass rates. Even a simple spreadsheet outperforms intuition by orders of magnitude.

Learn to delete, not just add: Before adding a rule, ask whether it is truly useful or merely plausible, and look for existing rules that can be removed. Keep the SKILL.md short and sharp.

Explore emerging prompt‑optimization tools: Projects such as TextGrad, DSPy, GEPA, and EvoSkill are already implementing concepts similar to SkillOpt. Early adoption can provide a head‑start.

7. Closing Thoughts

The author reflects that writing SKILL.md felt like an "engineer virtue" at first, but now sees it as a transitional phase of AI programming. History shows that any process that can be automated—hand‑written SQL, raw CSS—eventually is. SkillOpt may represent the watershed moment for prompt engineering.

What’s your experience with SKILL.md? Which part gives you the most headache?

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

automationAI Agentsprompt engineeringMicrosoft ResearchClaude CodeGPT-5.5SkillOpt
Code Mala Tang
Written by

Code Mala Tang

Read source code together, write articles together, and enjoy spicy hot pot together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.