Artificial Intelligence 26 min read

How to Let AI Skills Self‑Improve: L4 Evolution Design Principles from Autoresearch

The article defines L4 evolution as human‑defined boundaries plus automated agent exploration, introduces three core design principles from Karpathy’s Autoresearch—single modification surface, fixed evaluation standards, and human‑set boundaries—and shows how to apply them with a seven‑step skill‑engineering pipeline, tool comparisons, and a ratchet keep/revert mechanism.

Frontend AI Walk

Jun 26, 2026

How to Let AI Skills Self‑Improve: L4 Evolution Design Principles from Autoresearch

Definition of L4 Evolution

L4 evolution means humans define a boundary and a validation standard, then an agent explores automatically within that boundary. The agent edits SKILL.md and an Eval decides whether to KEEP the change (update the baseline) or REVERT it.

L3 vs L4 Core Difference

L3 Eval‑driven : human edits SKILL.md, Eval judges, human may edit again.

L4 Automatic Optimization : agent edits SKILL.md, Eval judges, human only designs the boundary.

The human role shifts from “editor” to “boundary designer”.

Three Tool Families

autoresearch – program.md + single‑file edit ( train.py) + fixed‑time loop. Suited for research/experimental skills with a single clear metric and an overnight time budget.

Darwin – ratchet (keep/revert) + multi‑round experiments. Suited for production‑grade skills that already have an Eval gate.

skill‑creator – rubric scoring + suggestion generation. Suited for skill creation or human‑assisted editing.

Decision tree:

Your skill is?
├─ Research/experimental → autoresearch (fixed time + single metric)
├─ Production continuous iteration → Darwin (ratchet + regression gate)
└─ Needs human‑assisted optimization → skill‑creator (rubric + suggestion)

Core Design Principle 1: Single Modification Surface

Only one file is mutable by the agent.

In autoresearch : train.py is editable; prepare.py, program.md (human) and infrastructure stay unchanged.

In skill engineering the editable surface is SKILL.md. All other files – evals.json, skill-issues.jsonl, scripts/*.py, references/*.md – are locked.

Benefits:

Diff is reviewable : limited change scope makes logs clear.

Clear boundary : the agent knows exactly what it may modify.

Safety : core infrastructure remains stable.

Core Design Principle 2: Fixed Validation Standard

The metric and budget must stay constant across rounds.

Metric : val_bpb (bits per byte) in autoresearch; eval pass rate / rubric score in skill engineering.

Time/Budget : fixed 5‑minute training in autoresearch; token or time limit per experiment in skill engineering.

Comparability : using the same baseline (e.g., baseline-2026-06-01.json) ensures results are comparable.

validation:
  eval: evals.json
  pass_rate_threshold: 80%
  rubric_score_min: 9.0
  baseline: baseline-2026-06-01.json  # fixed baseline

Core Design Principle 3: Permission Boundary

Define what the agent may change.

Allowed in SKILL.md: trigger description, step order, checklist, template content, examples – all low‑risk, language‑only changes that Eval can verify.

Not allowed: business logic, external commitments, permission/security rules, and the validation definition ( evals.json) – these require domain knowledge or carry high risk.

✅ Auto‑editable:
  - description, steps, checklist, template, examples
❌ Not auto‑editable:
  - business logic, external commitments, security rules
  - evals.json, scripts/, references/

Program.md as the Research Process

In autoresearch the human iterates program.md (research orchestration) while the agent iterates train.py (experiment object). The same pattern maps to skill engineering: program.md ↔ overnight-config.yaml (human‑edited experiment config) train.py ↔ SKILL.md (agent‑editable skill content)

You program a set of rules for the agent to explore, instead of writing a static skill for the agent to execute.

Practical Overnight Experiment (WeChat Article Review)

# overnight-skill-experiment.yaml
skill: wechat-article-review
budget:
  tokens: 500K
  time: 8 hours
experiments: 50

validation:
  eval: evals.json
  baseline: baseline-2026-06-01.json
  pass_rate_threshold: 80%
  rubric_score_min: 9.0

modification_scope:
  allowed: [description, steps, checklists]
  forbidden: [business_logic, safety_rules, evals_definition]

ratchet:
  keep_if: pass_rate >= baseline
  revert_if: pass_rate < baseline
  update_baseline_on_keep: true

output:
  log: skill-experiment-log.md
  best_skill: SKILL.md.best
  baseline_updates: baseline-history.jsonl

Sample log excerpt:

# skill-experiment-log.md
## Experiment 001
- Modification: adjust trigger word, add "WeChat article" keyword
- Result: pass_rate 78% → revert (below baseline 80%)
- Action: revert to baseline

## Experiment 002
- Modification: reorder steps to "check title → check structure → check language"
- Result: pass_rate 82% → keep (above baseline)
- Action: update baseline to 82%

## Experiment 048
- Modification: streamline trigger, merge steps, optimise checklist
- Result: pass_rate 92% → keep (best)
- Action: save best version as SKILL.md.best

Ratchet Mechanism (Keep / Revert)

# Autoresearch ratchet
Experiment → evaluate val_bpb
├─ improvement (val_bpb < baseline) → keep (update baseline)
└─ regression (val_bpb ≥ baseline) → revert (discard change)

Skill‑engineering mapping:

# Skill ratchet
Agent edits SKILL.md → run Eval
├─ pass_rate ≥ baseline → keep (update baseline)
└─ pass_rate < baseline → revert (keep old baseline)

ratchet:
  keep_if: pass_rate >= baseline
  revert_if: pass_rate < baseline
  update_baseline_on_keep: true
  max_baseline_history: 100  # retain history for rollback

Safety guarantees:

Baseline anchor : each improvement has a safe rollback point.

No regression : only improvements are accepted.

Traceability : baseline-history.jsonl retains all versions.

Summary of L4 Evolution

L4 evolution = programming a set of exploration rules for the agent.

Three Design Principles

Single modification surface : only SKILL.md is mutable.

Fixed validation standard : evals.json + baseline remain unchanged.

Human‑set boundary : business logic, permissions, and infrastructure stay immutable.

L3 Eval‑driven: Human edits Skill → Eval decides → Human edits again
L4 Automatic: Human designs boundary → Agent edits Skill → Eval decides → keep/revert

Key change: humans become “boundary designers” rather than direct editors.

Seven‑Step Closed Loop (Skill‑Engineering Scaffold)

① Collect feedback: after real use, append a JSON line to skill-issues.jsonl
② Triage: triage_issues.py ranks issues by severity × frequency × eval conversion
③ Issue → Eval: convert_issue_to_eval.py structures the issue as a test case
④ Baseline: grade_evals.py snapshots the current state
⑤ Single‑hypothesis mutation: write evolution-hypothesis.json, change SKILL.md in one dimension
⑥ Verify: grade_evals.py → check_regression.py → KEEP or REVERT
⑦ Release: UPDATE CHANGELOG, bump version, mark_issue_resolved.py

Step 1 – Collect Feedback

Append a line to skill-issues.jsonl after each task, e.g.:

{"date":"2026-06-22","skill":"frontend-dev-prompt-craft","task_type":"API","symptom":"PRD path and project request path mismatch","expected":"output‑contract must record both PRD and project paths and label Mock/Integration","severity":"high","source":"session_retro","converted_to_eval":false,"eval_id":null,"status":"open"}

Step 2 – Triage

Run:

./plugins/frontend-team-toolkit/skill-engineering/bin/run-evolution-cycle.sh \
  --skill frontend-dev-prompt-craft --phase triage

The script scores issues by severity × repeat × eval‑converted. The highest‑scoring open issue is selected for the next evolution round.

Step 3 – Issue → Eval

Convert an issue line to a structured eval case in evals/evals.json. Field mapping (excerpt): symptom → prompt (minimal reproducible user request) expected → expected[] (each expected behavior as a separate string) severity → risk (direct mapping)

Prohibited actions: merging multiple issues into one eval, using vague expectations, or editing the baseline before the mutation.

Step 4 – Baseline

When fixtures exist (CI‑friendly):

python3 plugins/frontend-team-toolkit/skill-engineering/scripts/grade_evals.py \
  --skill frontend-dev-prompt-craft \
  --skills-base plugins/frontend-team-toolkit/skills \
  --mode all \
  --append-results

When real agents are needed (no fixture):

python3 plugins/frontend-team-toolkit/skill-engineering/scripts/run_evals.py \
  --mode release --skill frontend-dev-prompt-craft \
  --skill-base-path plugins/frontend-team-toolkit/skills \
  --output /tmp/results.tsv

Step 5 – Single‑Hypothesis Mutation

Write evolution-hypothesis.json describing the one‑dimensional change, e.g.:

{
  "skill":"frontend-dev-prompt-craft",
  "target_eval":"frontend-dev-prompt-craft-007",
  "problem":"PRD path and project request path not recorded both",
  "proposed_change":"validate-output.sh --chain add PRD path / project path / userType check",
  "dimension":"output-contract",
  "rollback_condition":"eval-001~006 regression fail",
  "issue_ref":"skill-issues.jsonl:L10,L11,L14",
  "status":"verified"
}

Editable dimensions include trigger, workflow, output‑contract, template, anti‑pattern.

Step 6 – Verify (Spot → Regression → Ratchet)

Run full fixture regression and apply the ratchet gate:

./plugins/frontend-team-toolkit/skill-engineering/bin/run-evolution-cycle.sh \
  --skill frontend-dev-prompt-craft --phase verify --apply-results

python3 plugins/frontend-team-toolkit/skill-engineering/scripts/check_regression.py \
  --results plugins/frontend-team-toolkit/skills/frontend-dev-prompt-craft/results.tsv \
  --risk high --block true

Decision table:

Target eval PASS + all regression green → KEEP (bump version, write CHANGELOG).

Any high‑risk regression fails → REVERT (roll back mutation).

New capability eval below threshold → WARN (record but do not block KEEP).

Step 7 – Release & Monitoring

Typical actions after a KEEP:

Update CHANGELOG.md and .skill-meta.json (version, baseline).

Mark related issues as fixed via mark_issue_resolved.py.

Optionally write LEARNINGS.md to capture lessons.

Run evolution_report.py to view trend.

./plugins/frontend-team-toolkit/skill-engineering/bin/run-evolution-cycle.sh \
  --skill frontend-dev-prompt-craft --phase report

Real‑World Timeline (frontend‑dev‑prompt‑craft)

v0.1.0 → v0.1.1: Observation + First Eval

Issue: 2026‑06‑22 “order inheritance” generated 8 retro issues.

Action: added workflow steps for repository scanning, added checkpoint mapping for dual‑track paths.

Added eval‑007 for craft‑loop use case.

Baseline: eval‑001~006 already PASS (14/14).

v0.1.1 → v0.1.2: First Self‑Evolution (KEEP)

Selected high‑severity issue L10 (API path dual‑track).

Dimension: output‑contract.

Change: validate-output.sh --chain to check PRD path / project path / userType.

Marked 4 issues (L9‑L11‑L14) as fixed.

Verification: eval‑007 fixture PASS → KEEP.

v0.1.2 → v0.1.3: Full Fixture Completion

Problem: eval‑001~006 lacked fixtures, preventing full CI automation.

Action: added 6 golden fixtures, extended validate-output.sh with profile flags.

All 7 evals now bound to fixtures + validation script.

Verification: 7/7 PASS, maturity upgraded to beta.

Takeaways

L4 ≠ fully automatic skill writing . Humans design boundaries; the agent only edits SKILL.md and Eval decides KEEP/REVERT.

Three principles : single modification surface, fixed validation standard, human‑set permission boundary.

Seven‑step closed loop : issue → triage → convert → baseline → single‑hypothesis mutation → verify → release.

Mnemonic: “Problem observable, Eval anchored, Mutation single‑hypothesis, Regression ratchet.”

Based on karpathy/autoresearch, Darwin skill ratchet, Anthropic skill‑creator, and the skill‑engineering scaffold v1.0 trial records.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Automation AI agents Continuous Integration skill engineering autoresearch ratchet mechanism L4 evolution

Written by

Frontend AI Walk

Looking for a one‑stop platform that deeply merges frontend development with AI? This community focuses on intelligent frontend tech, offering cutting‑edge insights, practical implementation experience, toolchain innovations, and rich content to help developers quickly break through in the AI‑driven frontend era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Definition of L4 Evolution

L3 vs L4 Core Difference

Three Tool Families

Core Design Principle 1: Single Modification Surface

Core Design Principle 2: Fixed Validation Standard

Core Design Principle 3: Permission Boundary

Program.md as the Research Process

Practical Overnight Experiment (WeChat Article Review)

Ratchet Mechanism (Keep / Revert)

Summary of L4 Evolution

Three Design Principles

Seven‑Step Closed Loop (Skill‑Engineering Scaffold)

Step 1 – Collect Feedback

Step 2 – Triage

Step 3 – Issue → Eval

Step 4 – Baseline

Step 5 – Single‑Hypothesis Mutation

Step 6 – Verify (Spot → Regression → Ratchet)

Step 7 – Release & Monitoring

Real‑World Timeline (frontend‑dev‑prompt‑craft)

v0.1.0 → v0.1.1: Observation + First Eval

v0.1.1 → v0.1.2: First Self‑Evolution (KEEP)

v0.1.2 → v0.1.3: Full Fixture Completion

Takeaways

Frontend AI Walk

How this landed with the community

Was this worth your time?

0 Comments

Core Design Principle 1: Single Modification Surface

Core Design Principle 2: Fixed Validation Standard

Core Design Principle 3: Permission Boundary

Step 1 – Collect Feedback

Step 2 – Triage

Step 3 – Issue → Eval

Step 4 – Baseline

Step 5 – Single‑Hypothesis Mutation

Step 6 – Verify (Spot → Regression → Ratchet)

Step 7 – Release & Monitoring