Artificial Intelligence 11 min read

EvoSkill: Turning AI Failures into 12% Accuracy Gains with Automated Skill Evolution

The EvoSkill framework introduced by Sentient and Virginia Tech researchers equips large language models with a text‑feedback loop that automatically discovers, refines, and validates reusable agent Skills, boosting task‑specific accuracy by 12.1% and enabling cross‑domain transfer without altering the underlying model parameters.

SuanNi

Apr 2, 2026

EvoSkill: Turning AI Failures into 12% Accuracy Gains with Automated Skill Evolution

Background and Motivation

Current AI programming assistants such as Claude Code, OpenHands, and Codex rely on code as an intermediate representation, allowing agents to act as general problem solvers. However, this flexibility does not automatically provide the domain expertise required for highly specialized tasks, and most Skills are manually authored, which is time‑consuming and limits scalability.

Missing Professional Skills

Developers typically enhance systems with manually crafted Skills—structured workflows, operation guides, and auxiliary code—that must be written by experts with deep business knowledge. As the number of target applications grows, manual Skill creation becomes a bottleneck.

EvoSkill Framework

EvoSkill addresses this bottleneck by shifting the focus from low‑level prompts or code to higher‑level, structured, reusable Skills. The framework employs a text‑feedback mechanism where three specialized agents collaborate:

Executor Agent : Starts from a blank slate without any pre‑existing Skills and attempts to complete tasks.

Proposer Agent : Analyzes the Executor’s execution traces, compares predictions with ground‑truth answers, pinpoints failure causes, and proposes new or revised Skills.

Skill‑Builder Agent : Implements the Proposer’s suggestions, generating concrete Skill folders that include metadata, formatted guides, and optional Python or TypeScript helper scripts.

Each new Skill undergoes strict isolation testing; only those that improve performance on a validation set are retained. The system maintains a Pareto‑optimal pool of elite programs, replacing the weakest members only when a candidate surpasses them.

Failure‑Driven Evolution

The core loop treats failure as a learning signal. The framework selects tasks the current agents cannot solve, conducts deep error analysis, and updates the Skill knowledge base while keeping the underlying large language model frozen.

Experimental Evaluation

OfficeQA Benchmark : A complex document‑reasoning dataset built from U.S. Treasury reports (≈89,000 pages). Using Claude Code with Opus 4.5, the baseline zero‑error accuracy was 60.6%. After training EvoSkill on only 10% of the data for several rounds, accuracy rose to 65.8%. By merging independently discovered Skills into a unified Skill library, the best accuracy reached 67.9% (a 7.3‑point gain).

SealQA Benchmark : An open‑web QA set requiring robust search and verification. EvoSkill lifted accuracy from 26.6% to 38.7%, a 12.1‑point absolute increase.

Key discovered Skills included a data‑extraction verification Skill that mitigates table‑parsing errors and a quantitative analysis Skill that enforces pre‑calculation data checks, preventing systematic failures.

Cross‑Domain Transfer

The “search‑persistence” protocol Skill, originally evolved on SealQA, was directly applied to the BrowseComp web‑browsing QA benchmark. Despite the different task nature, accuracy improved from 43.5% to 48.8% (+5.3%). This demonstrates that EvoSkill‑generated modular Skills can generalize beyond their original domains.

Conclusion

By moving optimization from opaque prompts and code to structured, reusable Skills, EvoSkill provides a scalable path for continuous AI improvement. The framework’s evolutionary loop, grounded in failure analysis and rigorous validation, creates modular expertise that can be transferred to novel tasks, effectively turning each mistake into a lifelong learning module for AI agents.