Artificial Intelligence 23 min read

How to Build a Skills Engineering System for AI Agents from Scratch

When AI agents ignore the rules you wrote, the problem isn’t the prompts but the lack of a systematic Skills Engineering framework; this guide walks you through designing, looping, testing, versioning, and scaling reusable AI Skills so teams can reliably embed AI into their development pipelines.

Frontend AI Walk

Jun 20, 2026

How to Build a Skills Engineering System for AI Agents from Scratch

Introduction

You spend hours crafting a Skill for an AI coding assistant, only to see the AI deviate, skip safety checks, hallucinate, or rewrite code outside the scope. The root cause is not the prompt but the absence of a full Skills Engineering system that makes the Skill designable, testable, iterable, composable, and shareable.

What Skills Engineering Is

A simple Skill consists of trigger + process + output constraints. An engineered Skill system adds five dimensions:

Skills Engineering = designable rules + verifiable loop + repeatable iteration + composable chain + shareable assets

Engineering Layers

Temporary Prompt : Written directly in the chat. Issue – not reusable or auditable. Engineering – extract to SKILL.md.

Single Skill : Used locally. Issue – unstable triggers and unclear boundaries. Engineering – add verification loops and test cases.

Skill Collection : Multiple Skills grouped together. Issue – overlapping responsibilities and competition. Engineering – define directory conventions and composition protocols.

Team Engineering System : Shared across the team. Issue – experience is not captured, quality cannot evolve. Engineering – manage with Git, version releases, and Bad Case feedback loops.

Design Principles (Single Responsibility, Progressive Disclosure, Appropriate Freedom)

Anthropic’s guide stresses “keep it minimal”. The three iron rules are:

Rule 1: Single Responsibility

spec-writer   – write specs
task-planner  – break tasks
code-reviewer – review code
release-checklist – release checks
team-weekly-report – generate weekly report

Complex workflows are built from multiple small Skills rather than one giant Skill.

Rule 2: Progressive Disclosure

SKILL.md

is the directory, not the whole book. Core commands stay in the main file (<500 lines); detailed references live in sub‑files that the AI loads on demand.

code-reviewer/
├── SKILL.md          # core commands (<500 lines)
├── security-checks.md   # optional security checks
├── performance.md       # optional performance checks
└── examples/
    ├── good-review.md   # good example
    └── bad-review.md    # bad example

Rule 3: Give AI the Right Degree of Freedom

High (free‑form text): suitable when many approaches work, e.g., code review that adapts to context.

Medium (pseudo‑code/template): suitable when a preferred pattern exists, e.g., report generation using a template.

Low (precise script): suitable for fragile operations that must be exact, e.g., database migration steps.

Think of AI as walking a narrow bridge (low freedom) or an open plain (high freedom).

Loop Verification – Adding a “Brake System”

Normal Skills tell the AI *what* to do; Loop‑enhanced Skills also dictate *how* to do it, including pre‑anchor, step‑by‑step execution, self‑validation, and corrective actions.

Example – Code Review Skill:

## Code Review Process
1. Analyze code structure
2. Check potential bugs
3. Suggest improvements
4. Verify compliance

Loop‑enhanced version:

## Code Review Process

### L0: Anchor (must run before any changes)
- [ ] Read all files in the PR, record exports and signatures
- [ ] Confirm lint rules and style guide
- [ ] Identify impact modules

### L1: Execute (ordered steps)
1. Review interface contracts (params, returns, error codes)
2. Review business logic (edge cases, exception paths)
3. Review style (naming, comments, formatting)

### L2: Validation Checklist
| ID | Check Item | Expected | Result | Evidence |
|----|------------|----------|--------|----------|
| V-1 | Interface types match docs | Match | _ | _ |
| V-2 | All catch blocks have error handling | No empty catch | _ | _ |
| V-3 | No new lint violations | Lint passes | _ | _ |

### L3: Fix Rules
- Only fix FAIL items; do not touch PASS items
- If the same item fails twice, pause and hand over to a human

## Exit Conditions
| Scenario | Action |
|----------|--------|
| All PASS | Output review report |
| After 3 rounds with FAIL | Output unresolved issues, hand over |

Key differences:

Pre‑anchor : AI must first understand the current project state, eliminating hallucinated functions.

Self‑validation : AI must provide PASS/FAIL evidence (line numbers or snippets) for each checklist item.

Boundary‑preserving fixes : Only failed items are modified; passed items stay untouched.

End‑to‑End Example: Building a spec-writer Skill

Goal: Transform vague requirements into a structured specification.

Step 1 – Design (apply Rules 1‑3)

---
name: spec-writer
description: "When a user provides vague requirements, product ideas, or feature thoughts, convert them into a structured spec document. Use when the user mentions writing a Spec, requirement clarification, or clarifying needs."
---

Step 2 – Add Loop

L0 Anchor : Read existing interface docs and data models; list missing information if insufficient.

L1 Execute : Fill template sections – goal, scope, acceptance criteria, open questions.

L2 Validate : Check each acceptance criterion is testable and each open question has an owner.

L3 Fix : Only fix FAIL items; PASS items remain unchanged.

Step 3 – Write Tests & Run Baseline

Run the Skill without any rules to see what the AI skips (usually impact analysis and open questions). Then write a minimal Skill to cover those gaps and create three test cases: normal input, missing context, and cross‑module change.

Step 4 – Commit & Bad‑Case Feedback

Submit a PR with the new Skill. If the AI incorrectly marks an open question as completed, add a rule in SKILL.md prohibiting that behavior. Re‑run the test; the issue disappears.

Full loop: Design → Loop → Test → Commit → Bad‑Case Feedback → Retest .

Testing Iteration – Evaluation‑Driven Development

Anthropic recommends writing tests before the Skill.

Baseline

Run the task without any Skill and record which steps are skipped, which context is missing, and which outputs are unsatisfactory.

Minimal Skill

Write only the rules needed to fix the baseline failures; avoid “just in case” bloat.

Test Cases

Create an examples/ directory with three scenarios:

code-reviewer/
├── SKILL.md
├── examples/
│   ├── input-normal.md          # normal PR
│   ├── input-missing-context.md # missing info
│   └── input-cross-module.md   # front‑end + back‑end + DB change
└── tests/
    └── checklist.md            # verification checklist

Bad‑Case Recording

## Bad Case Record
### Input
User provides only a file path, no context.
### Wrong Output
AI skips L0 anchor, fabricates nonexistent dependencies.
### Expected Behavior
At L0, AI should list missing information and ask for clarification.
### Rule Modification
Add to SKILL.md L0: "If unable to read all related files, pause and output a missing‑info list."

Iterate:

Run test → Capture Bad Case → Locate cause → Modify SKILL.md → Add example → Retest

Version Management – Treat Skills Like Code

All team‑level Skills must live in Git; this is non‑negotiable.

Recommended Directory Layout

ai-skills/
├── engineering/
│   ├── code-reviewer/
│   │   ├── SKILL.md
│   │   ├── security-checks.md
│   │   ├── examples/
│   │   └── tests/
│   ├── spec-writer/
│   ├── task-planner/
│   └── release-checklist/
├── product/
│   ├── requirement-clarifier/
│   └── user-story-writer/
└── operation/
    ├── team-weekly-report/
    └── incident-summary/

Map the structure to tool‑specific directories such as .cursor/rules/ or .claude/skills/.

PR Guidelines

## Background
Fix the "skip security check" issue in code review.
## Changes
- Add security check step in L1
- Add input-security-sensitive.md example
- Update test checklist
## Verification
- [ ] Normal PR flow unchanged
- [ ] PRs touching auth/crypto trigger security check
- [ ] Missing info triggers clarification

Over‑broad description fields cause false positives; overly narrow ones block real work. Treat Skill changes like code changes.

Skill Composition – Making Multiple Skills Cooperate

When many Skills exist, define clear boundaries and use a top‑level orchestrating command instead of hard‑coded pipelines.

Principle 1 : description must delineate scope to avoid overlap.

Principle 2 : Use a higher‑level instruction to chain Skills, e.g.:

Please follow these steps:
1. Use spec-writer to clarify requirements and gaps.
2. Use task-planner to split into verifiable tasks.
3. After implementation, run code-reviewer for risk checks.
4. Finally, generate a weekly summary with team-weekly-report.

Principle 3 : Simple tasks should not be over‑orchestrated; a single Skill can handle a quick edit or button addition.

Example command set from addyosmani/agent-skills (converted from table): /spec – requirement specification (first step). /plan – task breakdown. /build – incremental implementation. /test – prove functionality with tests. /review – code quality check before merge. /code-simplify – reduce complexity. /ship – pre‑release checklist.

The value lies in giving AI a sense of “stage” rather than a single monolithic action.

Team Rollout – 30‑Day Roadmap

Week 1 : Pick three high‑frequency, repeatable scenarios (code review, release check, weekly report).

Week 2 : Build a minimal Skill for each scenario, including description, usage, L0 anchor, execution flow, L2 checklist, prohibited actions, and at least two test cases.

Week 3 : Collect Bad Cases using the prescribed template and add them as examples.

Week 4 : Establish review rules (PR required, owner assigned, run ≥2 test cases per change, monthly cleanup, onboarding docs).

Newcomer Onboarding (Three Steps)

Step 1: Use three high‑frequency Skills to feel the benefit.

Step 2: Edit a rule by adding a real Bad Case, learning that Skills are editable contracts.

Step 3: Write a small, single‑purpose Skill for a personal frequent task and contribute it to the shared repo.

Pre‑Release Checklist (10 Items)

Specific description for a concrete scenario.

L0 anchor step present.

L2 validation checklist defined.

AI asks for missing information when needed.

Stable output format defined.

Prohibited actions listed.

At least two input examples, one covering an edge case.

Bad‑Case recording mechanism in place.

Skill stored in Git.

Owner assigned for maintenance.

If all items pass, the Skill is ready for team trial.

Architecture Overview (Image)

Layer Summary

Design Layer : Single responsibility, progressive disclosure, freedom matching.

Loop Layer : L0 anchor → L1 execute → L2 validate → L3 fix → exit conditions.

Iteration Layer : Evaluation‑driven, minimal Skill, test cases, Bad‑Case feedback.

Asset Layer : Git management, PR review, version release, owner governance.

Composition Layer : Clear boundaries, top‑level orchestration, avoid over‑orchestration for simple tasks.

30‑Day Rollout : Scenario selection → minimal Skill → Bad‑Case collection → review & release rules.

Final Thought

AI tools evolve rapidly—today it may be Codex or Cursor, tomorrow Claude Code, Gemini CLI, or a newer agent platform. What never changes is the need for teams to turn engineering experience into reusable, verifiable, iterative assets. A Skill is the interface exposing team knowledge to an AI Agent; Skills Engineering is the methodology that designs, tests, iterates, composes, and manages those interfaces.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI prompt engineering testing Version Control Agent Skills Skills Engineering

Written by

Frontend AI Walk

Looking for a one‑stop platform that deeply merges frontend development with AI? This community focuses on intelligent frontend tech, offering cutting‑edge insights, practical implementation experience, toolchain innovations, and rich content to help developers quickly break through in the AI‑driven frontend era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

What Skills Engineering Is

Engineering Layers

Design Principles (Single Responsibility, Progressive Disclosure, Appropriate Freedom)

Rule 1: Single Responsibility

Rule 2: Progressive Disclosure

Rule 3: Give AI the Right Degree of Freedom

Loop Verification – Adding a “Brake System”

End‑to‑End Example: Building a spec-writer Skill

Step 1 – Design (apply Rules 1‑3)

Step 2 – Add Loop

Step 3 – Write Tests & Run Baseline

Step 4 – Commit & Bad‑Case Feedback

Testing Iteration – Evaluation‑Driven Development

Baseline

Minimal Skill

Test Cases

Bad‑Case Recording

Version Management – Treat Skills Like Code

Recommended Directory Layout

PR Guidelines

Skill Composition – Making Multiple Skills Cooperate

Team Rollout – 30‑Day Roadmap

Newcomer Onboarding (Three Steps)

Pre‑Release Checklist (10 Items)

Architecture Overview (Image)

Layer Summary

Final Thought

Frontend AI Walk

How this landed with the community

Was this worth your time?

0 Comments

Rule 1: Single Responsibility

Rule 2: Progressive Disclosure

Rule 3: Give AI the Right Degree of Freedom

Step 1 – Design (apply Rules 1‑3)

Step 2 – Add Loop

Step 3 – Write Tests & Run Baseline

Step 4 – Commit & Bad‑Case Feedback