Artificial Intelligence 36 min read

How to Build Reliable Agent Skills: Standards, Construction & Design Patterns

This guide explains the formal specification for Agent Skills, the three‑layer progressive loading mechanism, best‑practice SKILL.md structure, trigger design, a full development lifecycle with automated evaluation agents, and five proven design patterns for creating robust, reusable AI agent capabilities.

ITPUB

Jun 13, 2026

How to Build Reliable Agent Skills: Standards, Construction & Design Patterns

1. Skill Specification Standards

A Skill is a structured behavior package consisting of a SKILL.md file (YAML front‑matter + Markdown body) and optional scripts/, references/, assets/ directories.

Three‑layer progressive loading reduces token usage:

L1 (Directory layer) : only name and description (≈50‑100 tokens) are loaded at agent start‑up.

L2 (Instruction layer) : the full SKILL.md body is loaded when the skill is matched (≈<5000 tokens).

L3 (Resource layer) : files in scripts/, references/, assets/ are loaded on demand.

SKILL.md naming rules (field name )

name:
  - 1‑64 characters
  - lower‑case letters, digits, hyphens only
  - cannot start or end with a hyphen
  - no consecutive hyphens
  - must match the folder name

Description field ( description )

description:
  - 1‑1024 characters
  - clearly state the skill’s purpose and when to use it
  - include task‑specific keywords for model‑driven activation

Minimal example

---
name: skill-name
description: A description of what this skill does and when to use it.
---

Full example with optional fields

---
name: pdf-processing
description: Extract text and tables from PDF files, fill PDF forms, and merge multiple PDFs. Use when working with PDF documents.
license: Apache-2.0
metadata:
  author: example-org
  version: "1.0"
---
# PDF Processing
## When to use this skill
... (instructions, examples, error handling) ...

Recommended directory layout (depth ≤1)

skill-name/
├── SKILL.md
├── scripts/      # optional executable code (Python, Bash, JS)
├── references/   # optional supplemental docs
└── assets/       # static resources (templates, images, data files)

2. Skill‑Creator Core Ideas

The Skill‑Creator (Anthropic’s official “skill‑to‑skill” tool) treats skill development like software engineering: it defines training/validation sets, evaluation metrics, CI/CD‑style testing, and anti‑over‑fitting mechanisms.

Six‑stage development lifecycle

Requirement capture : clarify intent, trigger scenarios, output format, objective vs. subjective outcomes.

Skill authoring : create SKILL.md, optional scripts and resources.

Test execution : design 2‑3 test cases, run parallel with‑skill and without‑skill agents (A/B testing), draft quantitative assertions, capture timing data.

Evaluation & review : Grader scores assertions, Analyzer explains why one output is better, Comparator performs blind comparison.

Iterative improvement : incorporate Analyzer feedback, rewrite skill, repeat testing.

Optimization & publishing : run description optimization (auto‑tune trigger phrasing), package as .skill file.

Evaluation agents

Grader : checks whether assertions pass, flags weak or missing evidence, assigns PASS/FAIL.

Comparator : blind‑compares outputs of two agents (with vs. without skill) to avoid bias.

Analyzer : after comparison, explains why one output is better, categorises issues (content, structure, compliance), and suggests improvements.

Data flow JSON files (stored under the skill directory): evals.json, timing.json, metrics.json, grading.json, benchmark.json, comparison.json, analysis.json, history.json.

Practical example: building a Code‑Review skill

Prompt Claude: “I want to create a code‑review skill that analyses a Git diff and produces a severity‑ranked report.”

Claude generates a draft SKILL.md with name, description, and a placeholder script skeleton.

Define test cases (e.g., reviewing a PR that adds authentication, checking a database migration script).

Run parallel agents, collect outputs, grade with Grader, compare with Comparator, view results in Eval Viewer, submit feedback via feedback.json.

Iterate: refine description, re‑run optimization ( python -m scripts.run_loop --model claude‑haiku …), package skill ( python -m scripts.package_skill path/to/code‑review).

Known limitations (community feedback)

High token consumption during description optimization (up to 60% of a 5‑hour quota). Community suggests using cheaper models such as claude‑haiku.

Complex workflow with many confirmation steps can be cumbersome for simple skills.

Scalability: each test case spawns many sub‑agents, leading to queueing under concurrency limits.

Operational skills that merely wrap existing tool capabilities may never be triggered because the model can perform the task directly.

Skill bloat: iterative improvements can inflate skill size, violating the “keep it minimal” principle.

Steep learning curve: users must understand three‑layer loading, JSON schema hierarchy, and evaluation metrics.

3. Writing‑Skills Core Ideas

Writing‑Skills is a meta‑skill that teaches agents how to create new Skills. It follows a Test‑Driven Development (TDD) loop: RED (baseline failure), GREEN (minimal skill), REFACTOR (fix failures).

Key components

RED phase : run the task without the skill, capture failure modes and rationalisations.

GREEN phase : implement the minimal skill that addresses the specific failure.

REFACTOR phase : add explanations, improve robustness, and iterate.

Description best practice : describe only the trigger conditions; never summarise the internal workflow. A concise description (~50 tokens) forces the agent to read the full skill body instead of shortcutting.

Good vs. bad description example

# Bad – includes workflow
description: Use this skill when executing plans — dispatch sub‑agent per task with code review.

# Good – only trigger conditions
description: Use when executing implementation plans with independent tasks in the current session.

Design‑pattern categories for descriptions

High freedom – multiple valid implementations (e.g., code‑review workflow).

Medium freedom – preferred pattern with allowed variations (e.g., parameterised script templates).

Low freedom – strict, safety‑critical commands (e.g., database migration).

4. Skill Design Patterns (Google)

Google’s ADK team identified five recurring patterns for structuring Skills.

Tool Wrapper

Wraps external reference files (e.g., coding standards) and loads them only when needed.

---
name: api-expert
description: FastAPI best‑practice guide.
---
## Core rules
Load 'references/conventions.md' for the full list.
## Review steps
1. Load conventions
2. Check user code against each rule
3. Report violations with concrete suggestions

Generator

Template‑driven document generator that first asks the user for missing information.

---
name: report-generator
description: Generate a structured markdown report.
---
Step 1: Load style guide.
Step 2: Load report template.
Step 3: Ask user for topic, key findings, audience.
Step 4: Fill template.
Step 5: Return report.

Reviewer

Separates “what to check” from “how to check”; loads a checklist and evaluates user code, providing WHY explanations.

---
name: code-reviewer
description: Review Python code for quality, style, and common errors.
---
1. Load 'references/review-checklist.md'.
2. Scan user code.
3. For each violation record line, severity, reason, and fix suggestion.
4. Output structured report.

Inversion

Starts by asking the user a series of clarifying questions before any execution.

---
name: project-planner
description: Collect requirements before planning.
---
Phase 1 – Problem exploration
- What problem does the project solve?
- Who are the users?
- Expected scale?
Phase 2 – Technical constraints
- Deployment environment?
- Preferred tech stack?
- Non‑negotiable requirements?
Phase 3 – Synthesize & output plan

Pipeline

A strict multi‑step workflow with mandatory checkpoints; later steps cannot proceed until earlier ones are confirmed.

---
name: doc-pipeline
description: Generate API docs from Python source.
---
1. Parse source, list public APIs, ask user to confirm completeness.
2. Generate missing docstrings, wait for user approval.
3. Assemble documentation using a template.
4. Run quality checks; fix any issues before final output.

Pattern selection guide

Technical‑specific expertise → Tool Wrapper.

Consistent structured output → Generator.

Automated review → Reviewer.

Unclear requirements → Inversion.

Complex multi‑step tasks → Pipeline.

Pattern combinations (examples)

Pipeline + Reviewer: add a final automated review step to a multi‑stage pipeline.

Generator + Inversion: collect user data first, then fill a template.

Pipeline + Tool Wrapper: load expert knowledge at a specific pipeline stage.

Inversion + Pipeline: gather requirements before entering the execution pipeline.

5. Takeaways & Resources

Key takeaways

Skill ≠ Prompt : a Skill is a reusable, structured behavior package, not a static prompt.

Progressive loading is essential to keep token usage low while preserving full functionality.

Descriptions drive activation ; keep them concise and focused on trigger conditions.

Follow the six‑stage CI/CD‑style lifecycle for reliable development.

Use the five design patterns to match the problem domain and desired workflow.

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.