Is Your Skill Document Slowing Down the Model? Strategy‑Based Genes Are the Better Solution

The article analyses why large, document‑style Skill packages often degrade large‑model performance under limited inference budgets, introduces the compact, control‑dense Gene representation and the Gene Evolution Protocol (GEP), and shows through thousands of controlled experiments and CritPt benchmarks that Genes consistently outperform Skills, especially when token budget is tight.

Machine Heart
Machine Heart
Machine Heart
Is Your Skill Document Slowing Down the Model? Strategy‑Based Genes Are the Better Solution

Problem

Practitioners often create long Skill documents that include background, workflow, pitfalls, API notes, examples, and other human‑readable material. Experiments show that injecting a full Skill package (≈2,500 tokens) into a model can reduce performance compared with a no‑guidance baseline, indicating that completeness dilutes the control signal.

Gene definition

A Gene is a compact, verifiable experience object designed for high control density under a limited token budget. It consists of four signal types:

keywords

summary

strategy (ordered actionable steps)

AVOID warnings (constraints to avoid)

Each Gene includes a SHA‑256 content‑addressed hash (asset_id) for immutable identification, matching, replacement, mutation, and audit.

Domain keywords: uv‑vis, peak detection, FWHM, unit conversion
Summary: Detect peaks and compute wavelength‑domain peak properties correctly
Strategy:
  1. Detect peaks with prominence‑based criteria
  2. Convert min_distance into sample‑index units before peak detection
  3. AVOID: Report FWHM only after converting peak_widths outputs back to wavelength units

Gene Evolution Protocol (GEP)

GEP strings together three artifacts in a six‑stage loop:

Distill past failures, successes, and repair paths into a Gene.

When a new task arrives, scan the task context, match the most relevant Gene, and inject it as a system instruction.

Execute the task.

Write an immutable Event recording the outcome.

Validate the Gene against the Event.

Mutate and solidify the Gene pool without updating base model parameters.

The three artifacts are:

Gene : the control signal.

Capsule : a validated execution path plus audit record.

Event : an immutable evolution log.

Experimental setup

All experiments used two fixed Gemini 3.1 models (Pro Preview and Flash Lite Preview) with temperature 0.05, max output 16,384 tokens, and sandbox‑based checkpoint‑pass rate as the metric. The study covered 45 scientific‑code scenarios (4,590 controlled runs) and the public CritPt benchmark.

Results: Skill vs Gene

Injecting the same underlying experience as a ~2,500‑token Skill package performed 1.1 pp below the no‑guidance baseline.

The ~230‑token Gene lifted performance by 3.0 pp on average.

Skill helped the weaker Flash model (41.8 % → 49.0 %) but severely regressed the stronger Pro model (60.1 % → 50.7 %).

Component ablation on the Skill showed that only the workflow segment contributed positively, while the overview segment produced the largest negative impact.

Progressive ablations of Gene fields demonstrated that the strategy layer is the primary driver of gains; removing it reduces Gene performance to baseline, whereas keywords + summary alone provide no benefit.

Robustness experiments

A “stale_paradigm” Gene that used an outdated algorithm achieved 56.6 % accuracy, outperforming a clean Gene (54.0 %). Replacing the algorithm with an incorrect one dropped performance to 48.8 %, and swapping to an unrelated domain dropped it to 49.4 %. This indicates that Gene effectiveness depends on preserving the correct control framework rather than on novelty of the algorithm.

Failure encoding experiments

Embedding raw failure logs or full strategies into the model degraded performance.

Encoding failures as concise AVOID warnings consistently yielded the highest gains (e.g., “AVOID: Convert min_distance to sample‑index units before peak detection”).

Combining a Gene with failure warnings reduced performance (54.0 % → 52.0 %), showing that even the compact Gene should not be overloaded with raw failure data.

CritPt benchmark results

Using the Evolver system (base model + Gene pool + evolution engine) on the CritPt benchmark produced the following improvements:

Model A: 9.1 % → 18.57 % (+9.47 pp) on 2026‑02‑16.

Model B: 17.7 % → 27.14 % (+9.44 pp) on 2026‑03‑26.

Token‑related cost dropped from roughly $100 to under $1, achieved without any parameter updates, SFT, or RL.

Implications

Genes demonstrate that the shape of the experience object—not its length or raw information density—is what determines its utility for agents operating under inference budget constraints. By structuring experience as a protocol‑level object (keywords, summary, strategy, AVOID) with immutable hashing, agents can efficiently match, replace, and evolve control signals across tasks and across multiple agents.

Key resources:

Paper: "From Procedural Skills to Strategy Genes: Towards Experience‑Driven Test‑Time Evolution" (arXiv:2604.15097)

Evolver engine: https://github.com/EvoMap/evolver

CritPt reproducibility repository: https://github.com/EvoMap/critpt-openclaw-reproducible-70

AgentbenchmarkExperienceSkillGeneInference Budget
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.