How xMemory Cuts Tokens by 30% While Boosting Agent QA Scores Over 10 Points

The paper introduces xMemory, a hierarchical "split‑aggregate‑retrieve" framework that reduces token usage by up to 30% and improves QA performance by more than 10 points in long‑range agent conversations, outperforming traditional RAG across multiple LLMs.

PaperAgent
PaperAgent
PaperAgent
How xMemory Cuts Tokens by 30% While Boosting Agent QA Scores Over 10 Points

Why Traditional RAG Struggles with Agent Memory

Standard Retrieval‑Augmented Generation (RAG) assumes a large, heterogeneous document collection where individual paragraphs differ widely and dropping a single sentence has little effect. Agent memory, by contrast, consists of a single coherent dialogue flow with high redundancy across turns; removing one sentence can break the reasoning chain.

Consequences

Top‑k similarity retrieval returns many near‑duplicate “wheel‑talk” sentences.

Post‑pruning compresses timelines and coreference chains, causing multi‑hop reasoning to collapse.

Diagram showing RAG limitations in agent memory
Diagram showing RAG limitations in agent memory

xMemory: Split‑Aggregate‑Retrieve Framework

Three‑Step Process

Split – Build a 4‑level hierarchy: Original message → Fragment → Semantic → Theme, becoming more abstract upward.

Aggregate – Apply a sparse‑semantic objective that automatically splits overly large themes and merges tiny ones, preventing a “one‑pot” collapse.

Retrieve – Perform top‑down selection with an uncertainty gate: first choose relevant themes/semantics, then expand only the necessary parts of the original message so that tokens are spent on the most pertinent content.

Four‑Level Memory Tree

Original – Raw dialogue preserving timestamps and coreference chains.

Episode – Consecutive message blocks automatically segmented by a boundary‑detection prompt.

Semantic – Reusable factual units, e.g., “User moved to Seattle in Jan 2025”.

Theme – High‑level concepts such as “career planning” or “family relationships”.

Each node stores a complete evidence unit, eliminating mechanical slicing.

Sparse‑Semantic Objective

f(P) = SparsityScore + SemScore

SparsityScore encourages balanced theme sizes to avoid oversized candidate sets. SemScore keeps similar semantics close together and pushes different themes apart, preventing semantic islands.

In online incremental mode, new semantics are attached to the most recent theme; when a size threshold is crossed the system automatically splits or merges themes. This dynamic re‑allocation affects 44.9 % of nodes , making the memory increasingly organized as it grows.

Top‑Down Retrieval

Stage 1: Skeleton Selection – A submodular greedy algorithm selects representative nodes on the theme‑semantic layer that maximize coverage and relevance, inherently de‑duplicating.

Stage 2: Uncertainty Expansion – Only Episodes or original messages that significantly reduce the LLM’s predictive entropy are retained; any redundant sentence is discarded.

Experimental Evaluation

Token Reduction and Score Gains

LoCoMo (BLEU/F1): baseline 36.65/48.17 → xMemory 38.71/50.00 (+2.1 BLEU, +1.8 F1), token count ↓28 %.

PerLTQA (BLEU/F1/R‑L): baseline 33.44/41.79/38.43 → xMemory 36.79/46.23/41.25 (+3.4 BLEU, +4.4 F1, +2.8 R‑L), token count ↓38 %.

The trend holds across three LLMs (Qwen‑3‑8B, Llama‑3.1‑8B, GPT‑5‑nano); smaller models exhibit larger relative improvements.

Ablation Study

Hierarchy only : BLEU ↓2.7, token usage ↑53 % – better than vanilla RAG but still redundant.

+ Representative selection : BLEU ↓1.9, token usage ↑34 % – high‑level de‑duplication reduces redundancy.

+ Uncertainty gate : BLEU ↓1.2, token usage ↑39 % – fine‑grained pruning yields denser evidence.

Full xMemory : best BLEU, lowest token usage – components complement each other (1 + 1 > 2).

Evidence Density

The proportion of retrieval blocks that hit two or more answer tokens doubles compared with naive pruning, which collapses multi‑hit blocks to single hits.

Comparison of retrieval block usage between RAG and xMemory
Comparison of retrieval block usage between RAG and xMemory
In the same long dialogue, RAG needs 20 blocks to cover the answer, whereas xMemory succeeds with only 5; pruning reduces tokens but also removes crucial information.

Practical Recommendations

Avoid blind top‑k tuning – agent memory is highly self‑correlated; larger k adds redundancy. Hierarchical greedy selection yields immediate gains.

Prune with uncertainty gating – evidence chains are tightly coupled; hard pruning can cut essential premises. The uncertainty gate reduces tokens while improving accuracy.

Keep the structure dynamic – users may correct facts at any time; allowing themes to split/merge prevents the memory from becoming rigid.

The authors have released MIT‑licensed code (GitHub repository linked in the paper) and plan extensions for multimodal and federated‑privacy memory.

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
https://arxiv.org/pdf/2602.02007
LLMRAGInformation RetrievalAgent MemoryHierarchical RetrievalxMemory
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.