Can Frozen LLMs Keep Learning? Inside Memento‑Skills' Deployment‑Time Learning
The article analyses the Memento‑Skills paper and its open‑source implementation, showing how a frozen large language model can continuously improve by treating skills as external memory, using a five‑step Observe‑Read‑Act‑Feedback‑Write loop, advanced routing, and modular architecture to achieve significant gains on GAIA and HLE benchmarks.
Background
The paper Memento‑Skills: Let Agents Design Agents investigates whether a frozen large language model (LLM) can continue to improve by externalizing learning into a mutable skill memory M instead of updating model weights θ.
Deployment‑time Learning
Three learning paradigms are compared:
Pre‑training updates θ with massive token corpora.
Fine‑tuning updates θ on task‑specific data, incurring high cost and forgetting.
Deployment‑time learning keeps θ frozen and grows an external skill library with near‑zero compute cost.
Five‑step Closed Loop
The system operates in a repeatable cycle: Observe: receive a task and form the current state. Read: the skill router retrieves the most relevant skill from the library. Act: the frozen LLM executes the task using the retrieved skill. Feedback: a judge evaluates success or failure. Write: failure attribution, targeted rewrite, or skill discovery updates the skill artefact.
The Write step is the system’s “heartbeat”, turning reflection into auditable file changes.
Skill Artefacts
A skill is stored as a self‑contained artefact containing: SKILL.md – descriptive specification.
Stateful prompts bound to the current context.
Helper scripts with executable code and tool‑call logic.
Boundary conditions, utility scores, and historical patches.
Thus, skill = memory , not just a static prompt.
Intelligent Routing
Routing is framed as a decision problem. The authors generate synthetic queries for ~3k seed skills, train a single‑step offline RL model with an InfoNCE loss, and obtain a behavior‑aligned embedding ( Memento‑Qwen). A temperature parameter τ balances exploitation (small τ) and exploration (large τ).
Evaluation metrics include:
Recall@K for offline retrieval.
Route‑hit rate and judge‑success rate for end‑to‑end performance.
Results:
Recall@1 improves from 0.54 to 0.60.
Judge success rate rises from 0.50 to 0.80.
Benchmark Gains
On the GAIA benchmark, the score rises from 52.3 to 66.0 (≈26 % relative, +13.7 pp). On the harder HLE benchmark, it jumps from 17.9 to 38.7 (≈116 % relative, +20.8 pp). Skill libraries grow from 5 seed skills to 41 (GAIA) and 235 (HLE), forming clear clusters in the embedding space.
Theoretical Decomposition
The performance gap to an optimal strategy can be split into three independent knobs:
Stronger LLM – reduces intrinsic error ε_LLM.
Denser Skill Library – reduces memory radius r_M, limiting on‑the‑fly generalization.
Better Router Embedding – lowers routing error δ_M.
Each knob can be improved separately, making the system truly modular.
System Architecture
The architecture is layered:
Entry Layer : CLI / GUI.
Middle Layer : context compression, draft management, tool sandbox with safety policies.
Skill System Core : SkillStore → MultiRecall → UvSandbox → SkillGateway, where SkillGateway provides hot‑plug skill loading, avoiding hard‑coded if‑else branches.
This design prevents the system from devolving into a monolithic rule base as the skill set expands.
Write‑back Mechanism
When a task fails, the system performs:
Failure Attribution : pinpoint the responsible skill.
Targeted Rewrite : modify prompts or code.
Skill Discovery : synthesize a new skill if the existing one is no longer effective.
Unit‑test Gate : only merge changes that pass automated sandbox tests; otherwise roll back.
This ensures continuous learning remains clean and auditable.
References
Paper: 2603.18743v1.pdf – https://arxiv.org/pdf/2603.18743
Repository: https://github.com/Memento-Teams/Memento-Skills
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
