How XSKILL Lets Multimodal AI Agents Learn Without Updating Parameters

XSKILL introduces a dual‑stream framework that separates task‑level skills stored as Markdown and action‑level experiences stored as JSON, enabling multimodal large language model agents to continuously improve by extracting, summarizing, and reusing knowledge from past trajectories without modifying model parameters, achieving significant gains across visual tool, multimodal search, and integrated benchmarks.

SuanNi
SuanNi
SuanNi
How XSKILL Lets Multimodal AI Agents Learn Without Updating Parameters

Dual‑Stream Design of Skills and Experience

XSKILL unifies task‑level skills and action‑level experiences in a two‑stream architecture. Skills are kept as Markdown documents that provide structured workflows and reusable tool templates for specific tasks. Experiences are stored as JSON records containing trigger conditions, recommended actions, and semantic embeddings, offering concise guidance for local decisions such as tool selection, exploration strategy, and error recovery. The combination allows skills to guarantee robust execution while experiences steer strategic choices.

Illustration of the dual‑stream knowledge architecture
Illustration of the dual‑stream knowledge architecture

Knowledge Accumulation and Reasoning Phases

The system operates in two stages. During the accumulation phase, each training task is executed multiple times, generating diverse trajectories. Visual grounding extracts and summarizes these trajectories, producing skill fragments and experience entries. A cross‑trajectory critique mechanism further refines generalized knowledge by comparing successful and failed attempts.

In the reasoning phase, the agent dynamically retrieves relevant knowledge for a test task. The process includes task decomposition into sub‑queries, context‑aware visual adaptation, and non‑canonical injection of retrieved knowledge. An experience rewriter adapts conditions to the current visual state, while a skill adapter prunes irrelevant sections and integrates rewritten experiences into workflow steps. Executed skills and experiences are logged, forming a usage history that feeds back into the accumulation stage for continual refinement.

Diagram of accumulation and reasoning stages
Diagram of accumulation and reasoning stages

Experimental Validation and In‑Depth Analysis

The framework was evaluated on five benchmarks spanning three domains: VisualToolBench and TIR‑Bench for visual tool usage, MMSearch‑Plus and MMBrowseComp for multimodal search, and AgentVista as an integrated challenge. For each benchmark, 100 tasks formed a training set for experience accumulation, with the remaining tasks used for evaluation. Baselines included Agent Workflow Memory, Dynamic Cheat Sheet, and Agent‑KB. Four backbone models—Gemini‑2.5‑Pro, Gemini‑3‑Flash, GPT‑5‑mini, and o4‑mini—were tested.

Success rate (average over four independent runs and proportion of tasks with at least one successful run) served as the primary metric. XSKILL consistently outperformed baselines by 2.58–6.71 percentage points across models, with notable gains on complex visual reasoning tasks. For example, on TIR‑Bench using Gemini‑3‑Flash, XSKILL achieved a 47.75% average success rate, surpassing the strongest baseline (Agent‑KB) by 11.13 points. Knowledge transfer experiments showed that even without parameter updates, XSKILL improved GPT‑5‑mini and o4‑mini by 2.58–4.16 points, confirming cross‑model effectiveness.

Ablation studies revealed that removing either the skill or experience stream reduced performance by 3.04 and 3.85 points respectively, underscoring their complementary roles. The accumulation components contributed more to overall performance than the reasoning components. Detailed error analysis showed that skills dramatically lowered execution errors (overall error rate dropped from 29.9% to 15.3%, syntax errors from 20.3% to 11.4%, and tool‑name errors were nearly eliminated), while experiences enhanced tool‑selection patterns, increasing code‑interpreter usage on VisualToolBench from 66.63% to 76.97%.

Ablation results chart
Ablation results chart

Conclusions and Future Directions

XSKILL demonstrates that multimodal AI agents can achieve continuous improvement by externalizing knowledge into structured, interpretable skill and experience repositories, without any parameter updates. This approach enhances decision transparency, auditability, and cross‑model knowledge transfer. However, the growing knowledge base introduces risks that require human oversight, bias auditing of skill and experience documents, and access‑control policies for knowledge migration.

Overall, the work points to a broader trend: AI agents evolving from stateless systems toward lifelong learning entities that, like humans, accumulate and refine problem‑solving knowledge over time.

Future outlook illustration
Future outlook illustration
multimodal AIagent frameworkcontinuous learningbenchmark evaluationskill‑experience dual stream
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.