How xMemory Cuts Tokens by 30% While Boosting Agent QA Scores Over 10 Points
The paper introduces xMemory, a hierarchical "split‑aggregate‑retrieve" framework that reduces token usage by up to 30% and improves QA performance by more than 10 points in long‑range agent conversations, outperforming traditional RAG across multiple LLMs.
Why Traditional RAG Struggles with Agent Memory
Standard Retrieval‑Augmented Generation (RAG) assumes a large, heterogeneous document collection where individual paragraphs differ widely and dropping a single sentence has little effect. Agent memory, by contrast, consists of a single coherent dialogue flow with high redundancy across turns; removing one sentence can break the reasoning chain.
Consequences
Top‑k similarity retrieval returns many near‑duplicate “wheel‑talk” sentences.
Post‑pruning compresses timelines and coreference chains, causing multi‑hop reasoning to collapse.
xMemory: Split‑Aggregate‑Retrieve Framework
Three‑Step Process
Split – Build a 4‑level hierarchy: Original message → Fragment → Semantic → Theme, becoming more abstract upward.
Aggregate – Apply a sparse‑semantic objective that automatically splits overly large themes and merges tiny ones, preventing a “one‑pot” collapse.
Retrieve – Perform top‑down selection with an uncertainty gate: first choose relevant themes/semantics, then expand only the necessary parts of the original message so that tokens are spent on the most pertinent content.
Four‑Level Memory Tree
Original – Raw dialogue preserving timestamps and coreference chains.
Episode – Consecutive message blocks automatically segmented by a boundary‑detection prompt.
Semantic – Reusable factual units, e.g., “User moved to Seattle in Jan 2025”.
Theme – High‑level concepts such as “career planning” or “family relationships”.
Each node stores a complete evidence unit, eliminating mechanical slicing.
Sparse‑Semantic Objective
f(P) = SparsityScore + SemScoreSparsityScore encourages balanced theme sizes to avoid oversized candidate sets. SemScore keeps similar semantics close together and pushes different themes apart, preventing semantic islands.
In online incremental mode, new semantics are attached to the most recent theme; when a size threshold is crossed the system automatically splits or merges themes. This dynamic re‑allocation affects 44.9 % of nodes , making the memory increasingly organized as it grows.
Top‑Down Retrieval
Stage 1: Skeleton Selection – A submodular greedy algorithm selects representative nodes on the theme‑semantic layer that maximize coverage and relevance, inherently de‑duplicating.
Stage 2: Uncertainty Expansion – Only Episodes or original messages that significantly reduce the LLM’s predictive entropy are retained; any redundant sentence is discarded.
Experimental Evaluation
Token Reduction and Score Gains
LoCoMo (BLEU/F1): baseline 36.65/48.17 → xMemory 38.71/50.00 (+2.1 BLEU, +1.8 F1), token count ↓28 %.
PerLTQA (BLEU/F1/R‑L): baseline 33.44/41.79/38.43 → xMemory 36.79/46.23/41.25 (+3.4 BLEU, +4.4 F1, +2.8 R‑L), token count ↓38 %.
The trend holds across three LLMs (Qwen‑3‑8B, Llama‑3.1‑8B, GPT‑5‑nano); smaller models exhibit larger relative improvements.
Ablation Study
Hierarchy only : BLEU ↓2.7, token usage ↑53 % – better than vanilla RAG but still redundant.
+ Representative selection : BLEU ↓1.9, token usage ↑34 % – high‑level de‑duplication reduces redundancy.
+ Uncertainty gate : BLEU ↓1.2, token usage ↑39 % – fine‑grained pruning yields denser evidence.
Full xMemory : best BLEU, lowest token usage – components complement each other (1 + 1 > 2).
Evidence Density
The proportion of retrieval blocks that hit two or more answer tokens doubles compared with naive pruning, which collapses multi‑hit blocks to single hits.
In the same long dialogue, RAG needs 20 blocks to cover the answer, whereas xMemory succeeds with only 5; pruning reduces tokens but also removes crucial information.
Practical Recommendations
Avoid blind top‑k tuning – agent memory is highly self‑correlated; larger k adds redundancy. Hierarchical greedy selection yields immediate gains.
Prune with uncertainty gating – evidence chains are tightly coupled; hard pruning can cut essential premises. The uncertainty gate reduces tokens while improving accuracy.
Keep the structure dynamic – users may correct facts at any time; allowing themes to split/merge prevents the memory from becoming rigid.
The authors have released MIT‑licensed code (GitHub repository linked in the paper) and plan extensions for multimodal and federated‑privacy memory.
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
https://arxiv.org/pdf/2602.02007How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
