How Memento-Skills Enables Continuous Learning for Frozen LLM Agents
The article analyzes the limitations of frozen LLM agents—fixed parameters, loss of state, and costly fine‑tuning—and introduces the Memento‑Skills framework, which adds an external, evolvable skill memory to achieve deployment‑time learning, detailed architecture, optimization knobs, and strong experimental gains.
Problem Statement
When large‑model agents are deployed with frozen parameters, three fundamental limitations arise:
Parameter fixation : model weights \(\theta\) cannot be updated after deployment, so any adaptation must rely solely on the input prompt.
State loss : each task is processed with a limited context window; the agent cannot retain knowledge of past successes or failures.
Fine‑tuning paradox : updating the model requires dedicated data, high compute cost, and easily leads to over‑fitting, making continuous improvement impractical.
These constraints keep agents at a single‑execution level.
Memento‑Skills Framework
The framework introduces a deployment‑time learning paradigm that leaves the frozen LLM unchanged while adding an external, mutable skill memory M. Experience is stored as engineered skills rather than as weight updates, enabling zero‑parameter, low‑cost continual improvement.
https://arxiv.org/pdf/2603.18743
Skill as Memory
A Skill is a structured artifact that contains everything needed to execute and evaluate a specific capability. Each skill consists of: SKILL.md – a markdown specification describing intent, scope, and tool constraints.
State prompts – dynamic instructions that are injected into the LLM context based on the current task state.
Auxiliary scripts – executable Python code, tool‑calling logic, and validation rules.
Additional metadata – boundary conditions, utility scores, and historical repair outcomes.
Skills are searchable, rewriteable, replaceable, and verifiable, forming the physical carrier of the agent’s evolving capabilities.
Read‑Write Reflective Learning Loop
The system operates in a closed five‑step loop:
Observe : ingest a new task and merge it with the current system state.
Read : the skill router retrieves the most suitable skill from the skill store.
Act : the frozen LLM executes the retrieved code and prompts.
Feedback : a judge module evaluates the execution trace and determines success or failure.
Write : on success, the skill’s utility score is increased; on failure, the skill is sent to the optimization pipeline.
Failure‑Driven Skill Optimization
If a task fails, the system performs a three‑step repair:
Diagnosis : pinpoint the responsible skill (failure attribution).
Rewrite / Variation : for minor issues, edit prompts or code; for major failures, discover or synthesize a new skill.
Verification : run automated unit‑test gates in a sandbox; only passing modifications are merged into the global skill store.
Behavioral Routing Model
Instead of pure semantic similarity, the router predicts the execution success probability of a skill. Training involves:
Synthesizing ~3 k seed skills with LLM‑generated queries, creating positive and hard‑negative samples.
Offline reinforcement learning using an InfoNCE contrastive loss that maximizes the predicted success rate.
Producing a behavior‑aligned embedding model (named Memento‑Qwen) whose vector space reflects execution effectiveness rather than textual similarity.
A temperature parameter τ balances exploration (large τ) and exploitation (small τ).
Experimental Results
Evaluation was performed on two benchmarks using Gemini‑3.1‑Flash as the base model:
GAIA (real‑world multi‑task) : score improved from 52.3 to 66.0 (+13.7 points, +26.2%); success rate rose from 65.1 % to 91.6 %.
HLE (high‑difficulty academic tasks) : score rose from 17.9 to 38.7 (+20.8 points, +116.2%); success rate increased from 30.8 % to 54.5 %.
The skill library grew from 5 atomic skills to 41 (GAIA) and 235 (HLE), forming domain‑specific clusters in the embedding space, indicating genuine knowledge consolidation.
Three Independent Optimization Knobs
Stronger LLM : improve the base model’s generalization, reducing intrinsic error \(\epsilon_{LLM}\).
Denser Skill Library : increase coverage radius \(r_{M}\) so the agent can rely more on mature skills.
Better Routing Embedding : lower retrieval error \(\delta_{M}\) for more accurate skill matching.
Each knob can be tuned separately without retraining the entire system.
System Architecture
The architecture consists of three layers:
Entry layer : CLI/GUI interfaces for task intake.
Middle layer : context compression, draft management, and secure tool sandboxing.
Skill core : SkillStore, MultiRecall, UvSandbox, and SkillGateway enable hot‑plugging of skills without modifying core code.
This decouples agent capabilities from the core system, ensuring maintainability while supporting continuous evolution.
Key Takeaways
Memento‑Skills demonstrates that frozen LLM agents can achieve sustained capability growth through an external skill memory, zero‑parameter updates, and a white‑box optimization pipeline. The approach shifts lifelong learning from weight fine‑tuning to engineering‑level experience storage, offering a practical path for production‑grade agents.
AI Architecture Hub
Focused on sharing high-quality AI content and practical implementation, helping people learn with fewer missteps and become stronger through AI.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
