Automatically Evolve Claude Code Skills: Open‑Source System That Strengthens AI Tools Over Time
The darwin-skill project introduces a ratchet‑based optimization loop that scores each Skill on eight dimensions, generates improvement proposals, commits changes, re‑scores, and only retains upgrades, with human confirmation between phases, enabling scalable maintenance of dozens of AI agent Skills.
Pain Points
Agent Skill ecosystems have grown rapidly, with tools such as Claude Code, Codex, OpenClaw, Trae, and CodeBuddy supporting the SKILL.md format. Maintaining a small number of Skills is feasible, but managing 60+ Skills becomes difficult. Traditional Skill review checks only format, step numbering, and path accessibility, yet a perfectly formatted Skill may still produce poor results.
How Darwin‑skill Addresses the Problem
Inspired by Andrej Karpathy’s autoresearch, the system moves the autonomous experiment loop from model training to Skill optimization. The core mechanism is a Ratchet : scores can only increase; each iteration either improves the Skill or cleanly rolls back, preventing gradual degradation.
Process:
Identify the lowest‑scoring dimension.
Generate an improvement plan for that dimension.
Edit SKILL.md and commit via git.
A sub‑agent re‑scores the updated Skill.
If the new score exceeds the old score, keep the change; otherwise, revert the commit.
After each Skill is optimized, the system pauses, shows the diff and score change, and waits for user confirmation before proceeding to the next Skill.
Eight‑Dimension Evaluation System
The total score of 100 is split into two major blocks:
Structure (60 points) : assessed via static analysis, covering format compliance, path validity, and step completeness.
Effectiveness (40 points) : requires empirical testing; a Skill that looks good but performs poorly receives zero. The empirical performance dimension carries the highest weight (25 points).
Five Core Principles
Single Editable Asset : modify only one SKILL.md at a time, keeping variables controllable and improvements attributable.
Dual Evaluation : combine structural scoring (static analysis) with effect verification (run tests and check output).
Ratchet Mechanism : retain only improvements; automatically roll back regressions so scores never decrease.
Independent Scoring : use a sub‑agent for scoring to avoid self‑bias.
Human in the Loop : pause after each Skill optimization for user confirmation before continuing.
Five Stages of the Optimization Loop
The system runs autonomously within each stage but pauses between stages for human confirmation:
Phase 1: Assess current state and establish a baseline score.
Phase 2: Generate and execute an improvement plan.
Phase 3: Verify the effect of the improvement.
Phase 4: Ratchet decision – keep or revert the change.
Phase 5: User confirmation, then move to the next Skill.
Ratchet Mechanism Example
In a second round, a score of 75 fell below the current best of 78, triggering an automatic revert. The effective baseline remains locked at 78, and subsequent improvements build from that point. Scores can only ascend; regressions are fully eliminated.
How to Use
Installation command: npx skills add alchaincyf/darwin-skill After installation, invoke any Skill‑compatible Agent tool with commands such as “optimize all skills” or “optimize a specific skill.”
Conclusion
Design philosophy: create Skills like Nüwa, let Darwin evolve them. By retaining only improvements, time works in your favor.
GitHub: https://github.com/alchaincyf/darwin-skill
Geek Labs
Daily shares of interesting GitHub open-source projects. AI tools, automation gems, technical tutorials, open-source inspiration.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
