Why Overly Detailed AI Skills Hurt Performance: The Golden Rule for Large Model Experience Reuse

A Tsinghua and EvoMap study of 4,590 controlled experiments across 45 scientific tasks shows that feeding large language models with a 2,500‑token detailed Skill degrades pass rates, while a compact 230‑token strategy gene boosts performance by up to 3 percentage points.

SuanNi
SuanNi
SuanNi
Why Overly Detailed AI Skills Hurt Performance: The Golden Rule for Large Model Experience Reuse

Experience Overload

Researchers evaluated whether detailed experience manuals (Skill documents) improve large language model (LLM) performance on scientific code‑solving tasks. The baseline (no guidance) average pass rate across 45 benchmark scenarios was 51.0%.

Adding a full 2,500‑token Skill document reduced the average pass rate to 49.9%. On the Gemini 3.1 Pro Preview model the pass rate fell from 60.1% to 50.7%.

Stripping the Fat

Dissection of the Skill document showed that only the workflow sections contributed positive gains; overview and background paragraphs produced strong negative impact. Most content acted as inert control signal.

When the Skill was aggressively trimmed to a 230‑token budget (matching the size of a strategy gene) performance recovered, confirming that the degradation stemmed from packaging overload.

Strategy Gene Revolution

The team introduced a “strategy gene” (Gene) and a Gene Evolution Protocol (GEP). A gene encodes experience as high‑density control signals: concise keywords, brief summary, core strategy steps, and explicit warnings.

Results on the dual‑model average:

230‑token gene: 54.0% (+3.0 pp over baseline)

Gemini 3.1 Pro: 59.9%

Gemini 3.1 Flash Lite: 48.2% (up from 41.8%)

Ablation studies demonstrated that merely reducing token count was insufficient; performance peaked only after reorganizing experience into actionable strategy steps.

Robustness and Composability

Robustness tests:

Replacing algorithms or domain information caused pass rates to drop to 48.8%–49.4%.

Over‑constraining the gene had negligible effect; an over‑constrained variant achieved 55.9%.

Combining genes did not yield linear gains: two complementary genes fell to 44.9%, while two conflicting genes still passed the 53.2% threshold.

Evolutionary Carrier for Testing

To assess long‑term evolution, the researchers deployed a gene‑driven system on the CritPt benchmark using the OpenClaw runtime and EvoMap’s Evolver engine. The system continuously integrated failure histories, performed strict validation, and solidified successful updates.

Early evolution (February 2026) doubled the Pro model’s accuracy from 9.1% to 18.57% by iteratively diagnosing errors and applying minimal reversible patches. Subsequent iterations improved performance to 27.14% accuracy on a 70‑task suite.

Additional Findings

Attempts to expand a gene back into a full Skill failed, yielding 52.0% and 51.5% pass rates.

Pure failure‑warning signals achieved the highest single‑component score of 54.4%.

Structured, compact formats consistently outperformed unstructured or overly verbose representations.

References:

arXiv: https://arxiv.org/abs/2604.15097

Code repositories: https://github.com/EvoMap/evolver , https://github.com/EvoMap/critpt-openclaw-reproducible-70

prompt engineeringlarge language modelsAI evaluationEvoMapexperience reusestrategy gene
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.