Can Frozen LLMs Keep Learning? Inside Memento‑Skills' Deployment‑Time Learning

The article analyses the Memento‑Skills paper and its open‑source implementation, showing how a frozen large language model can continuously improve by treating skills as external memory, using a five‑step Observe‑Read‑Act‑Feedback‑Write loop, advanced routing, and modular architecture to achieve significant gains on GAIA and HLE benchmarks.

Architect
Architect
Architect
Can Frozen LLMs Keep Learning? Inside Memento‑Skills' Deployment‑Time Learning

Background

The paper Memento‑Skills: Let Agents Design Agents investigates whether a frozen large language model (LLM) can continue to improve by externalizing learning into a mutable skill memory M instead of updating model weights θ.

Deployment‑time Learning

Three learning paradigms are compared:

Pre‑training updates θ with massive token corpora.

Fine‑tuning updates θ on task‑specific data, incurring high cost and forgetting.

Deployment‑time learning keeps θ frozen and grows an external skill library with near‑zero compute cost.

Five‑step Closed Loop

The system operates in a repeatable cycle: Observe: receive a task and form the current state. Read: the skill router retrieves the most relevant skill from the library. Act: the frozen LLM executes the task using the retrieved skill. Feedback: a judge evaluates success or failure. Write: failure attribution, targeted rewrite, or skill discovery updates the skill artefact.

The Write step is the system’s “heartbeat”, turning reflection into auditable file changes.

Skill Artefacts

A skill is stored as a self‑contained artefact containing: SKILL.md – descriptive specification.

Stateful prompts bound to the current context.

Helper scripts with executable code and tool‑call logic.

Boundary conditions, utility scores, and historical patches.

Thus, skill = memory , not just a static prompt.

Intelligent Routing

Routing is framed as a decision problem. The authors generate synthetic queries for ~3k seed skills, train a single‑step offline RL model with an InfoNCE loss, and obtain a behavior‑aligned embedding ( Memento‑Qwen). A temperature parameter τ balances exploitation (small τ) and exploration (large τ).

Evaluation metrics include:

Recall@K for offline retrieval.

Route‑hit rate and judge‑success rate for end‑to‑end performance.

Results:

Recall@1 improves from 0.54 to 0.60.

Judge success rate rises from 0.50 to 0.80.

Benchmark Gains

On the GAIA benchmark, the score rises from 52.3 to 66.0 (≈26 % relative, +13.7 pp). On the harder HLE benchmark, it jumps from 17.9 to 38.7 (≈116 % relative, +20.8 pp). Skill libraries grow from 5 seed skills to 41 (GAIA) and 235 (HLE), forming clear clusters in the embedding space.

Theoretical Decomposition

The performance gap to an optimal strategy can be split into three independent knobs:

Stronger LLM – reduces intrinsic error ε_LLM.

Denser Skill Library – reduces memory radius r_M, limiting on‑the‑fly generalization.

Better Router Embedding – lowers routing error δ_M.

Each knob can be improved separately, making the system truly modular.

System Architecture

The architecture is layered:

Entry Layer : CLI / GUI.

Middle Layer : context compression, draft management, tool sandbox with safety policies.

Skill System Core : SkillStore → MultiRecall → UvSandbox → SkillGateway, where SkillGateway provides hot‑plug skill loading, avoiding hard‑coded if‑else branches.

This design prevents the system from devolving into a monolithic rule base as the skill set expands.

Write‑back Mechanism

When a task fails, the system performs:

Failure Attribution : pinpoint the responsible skill.

Targeted Rewrite : modify prompts or code.

Skill Discovery : synthesize a new skill if the existing one is no longer effective.

Unit‑test Gate : only merge changes that pass automated sandbox tests; otherwise roll back.

This ensures continuous learning remains clean and auditable.

References

Paper: 2603.18743v1.pdf – https://arxiv.org/pdf/2603.18743

Repository: https://github.com/Memento-Teams/Memento-Skills

LLMAgentAI Architecturecontinuous learningdeployment-time learningskill memory
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.