How Memento‑Skills Enables Self‑Evolving LLMs Without Fine‑Tuning

Introducing Memento‑Skills, a novel framework that freezes LLM parameters while an external skill library iteratively reads, writes, and refines capabilities, achieving up to 116% accuracy gains on GAIA and HLE benchmarks and demonstrating scalable self‑evolution without costly model fine‑tuning.

SuanNi
SuanNi
SuanNi
How Memento‑Skills Enables Self‑Evolving LLMs Without Fine‑Tuning

The Memento‑Skills system proposes a radical shift for large language models (LLMs): instead of fine‑tuning billions of parameters, the model remains frozen and continuously improves by reading from and writing to an external skill library. This read‑write loop enables the agent to self‑evolve, doubling accuracy on challenging benchmarks.

Farewell to Expensive Fine‑Tuning

Traditional LLM deployment keeps model weights fixed after pre‑training, relying solely on prompts and limited context for adaptation. Scaling compute yields diminishing returns, and simple prompt‑based adjustments cannot prevent repeated mistakes. Memento‑Skills replaces this paradigm with a State‑Reflexive Decision Process (SRDP) that grows a dynamic skill memory.

Each skill combines declarative specifications, prompts, and executable code, forming a personal “file cabinet” the model can query and modify. The system’s heartbeat is a closed‑loop process: for a new task, the agent retrieves the most relevant skill via a skill router, executes it step‑by‑step, and then writes back reflections and updates based on feedback.

If execution fails, a fault‑attribution selector examines the full trajectory, pinpoints the offending skill, and the skill rewriter proposes file‑level updates or entirely new strategies while preserving the original skill’s generality. All modifications pass automated unit tests and a comprehensive integration test before being deployed, with immediate rollback on failure.

Behavior Alignment for Precise Tool Matching

As the skill library expands, naive semantic routing based on surface similarity leads to mismatches (e.g., routing a refund request to a password‑reset skill). Pure end‑to‑end reinforcement learning suffers from an enormous exploration space. The team therefore adopted a single‑step offline RL framework, generating dense positive queries and hard negative samples using the LLM as a multi‑dimensional simulator.

InfoNCE loss guides the router to boost probabilities of rewarding skills while suppressing semantically similar but useless ones. Simple temperature tuning lets engineers balance exploitation versus exploration of the skill space.

On a 140‑query recall benchmark, the behavior‑aligned router achieved a top‑1 recall of 0.60, nearly double the 0.32 baseline from BM25. In end‑to‑end execution, success rates rose from 50 % to 80 %.

Cross‑Domain Transfer and Knowledge Clustering

The system was evaluated on two demanding benchmarks: GAIA (general AI assistant) and HLE (Human Level Examination). GAIA requires multi‑step reasoning, web browsing, and file operations; HLE spans eight academic domains, including math, physics, and humanities.

During GAIA training, the skill base grew from 5 atomic search‑like skills to 41 specialized skills. For HLE, the knowledge base exploded to 235 skill modules. Overall training success climbed from 65.1 % to 91.6 %.

On the HLE benchmark, the system achieved an overall accuracy of 38.7 %, more than double the 17.9 % of a read‑only control group. Notably, accuracy in humanities and biology surged to 66.7 % and 60.7 % respectively, illustrating effective skill transfer across domains.

Three Independent Knobs Behind Convergence

Performance curves on both benchmarks show classic diminishing returns: rapid early gains that plateau later. Early skill additions act like new villages on a frontier, providing large immediate benefits. Continuous skill‑refinement fills knowledge gaps, but as the skill “city” becomes dense, marginal gains shrink.

The team identified three orthogonal knobs for future improvement: upgrading the underlying LLM for stronger reasoning, increasing the number of reasoning rounds to expand skill coverage, and enhancing the vector embedding architecture to reduce retrieval errors.

In summary, Memento‑Skills demonstrates that a frozen LLM can achieve substantial self‑improvement through an external, iteratively refined skill library, offering a cost‑effective alternative to traditional fine‑tuning while scaling across diverse tasks and domains.

LLMReinforcement learningSelf‑evolution
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.