8×8 Matrix Gives LLMs Long‑Dialogue Memory with Just 0.12% Extra Parameters (δ‑mem)
δ‑mem introduces a compact 8×8 online state matrix that, without expanding context windows or altering the Transformer backbone, provides effective long‑term memory for large language models, achieving up to 1.31× performance gains on memory‑intensive tasks while adding only 0.12% parameters.
Problem
Large language models (LLMs) suffer from a memory bottleneck when used as long‑term personalized assistants or agents. Expanding the context window incurs quadratic attention cost and leads to "context rot", where longer contexts degrade performance.
Existing memory paradigms
Text Memory (TMM) : methods such as MemGPT, MemoryBank, Mem0, RAG store history as text fragments and inject them via the input context. Flexible but limited by context length and retrieval noise.
External‑Channel Memory (OMM) : methods such as Memorizing Transformers, LongMem, MLP Memory place a separate module that interacts with the backbone through retrieval or encoding. Adds engineering complexity and inference overhead.
Parameterized Memory (PMM) : methods such as LoRA, Prefix‑Tuning, ROME, MEMIT encode memory in frozen adapters or prefix vectors. Static after training and cannot adapt to new information dynamically.
δ‑mem design
δ‑mem introduces an 8×8 online associative‑memory matrix S to a frozen Transformer. The matrix participates directly in attention via a low‑rank correction, eliminating the need for larger context windows, architecture changes, or full‑parameter fine‑tuning.
Per‑token workflow
For each token position δ‑mem executes three ordered operations:
Read : query the previous state S_{t‑1} with the current key to obtain a prediction.
Steer : linearly project the read signal to generate low‑rank adjustments for the query and output branches of attention.
Write : update the state with a Delta‑rule that writes only the residual (prediction error) along the key direction, preserving already learned associations.
Key technology 1: Delta‑rule online state
The state update follows the formula shown in the diagram below. The model queries the old memory with the current key, obtains a prediction S_{t‑1}, and writes back only the residual along the key direction. This selective write ensures that the memory evolves stably without overwriting previously learned associations.
Key technology 2: Low‑rank attention correction
The read signal is linearly projected to produce corrections for both the query and output branches of the attention module. Unlike static adapters such as LoRA, the corrections vary with the evolving state, providing dynamic guidance.
Key technology 3: Write granularities
Three write strategies are evaluated:
Segment‑level write (SSW)
Token‑level write (TSW)
Multi‑state parallel write (MSW)
The optimal granularity depends on model size; segment‑level excels for larger models, while multi‑state parallel benefits smaller models.
Experimental evaluation
Benchmarks were run on Qwen3‑4B‑Instruct against representative baselines from the three paradigms (BM25 RAG, LLMLingua‑2, MemoryBank, Context2LoRA, MemGen, MLP Memory). Evaluation suites included general‑ability tests (IFEval, HotpotQA, GPQA‑Diamond) and memory‑intensive tasks (LoCoMo, MemoryAgentBench).
δ‑mem improved overall scores by 1.10× over the backbone and 1.15× over the strongest non‑δ‑mem baseline. Notable gains:
MemoryAgentBench: 29.54 → 38.85 (1.31×)
LoCoMo: 40.79 → 49.12 (1.20×)
HotpotQA EM/F1: 42.35/56.00 → 49.41/63.66
Cross‑backbone tests on 3B, 4B, and 8B models showed consistent improvements (e.g., Qwen3‑4B‑Instruct 46.79 → 51.66). The write strategy interacted with model capacity: segment‑level write gave the largest boost for the 8B model, while multi‑state parallel write yielded the biggest gain for the 3B model.
Ablation studies
When the original context was removed and only the compressed 8×8 state was injected, performance rose substantially, confirming that the matrix stores usable evidence chains:
HotpotQA EM: 0.08 % → 6.48 %; F1: 8.27 % → 15.20 %
MemoryAgentBench: 3.49 % → 8.05 %
Head‑injection ablations showed the output branch alone achieved 47.05, while the combination of query + output reached the best combined score of 47.97. Full four‑branch injection gave a marginally higher average (48.05) with limited efficiency gain.
Insertion‑depth experiments indicated that injecting the correction into all layers performed best (47.97). When only a subset of layers was used, middle layers contributed the most, balancing semantic abstraction and task‑specific computation.
Efficiency
δ‑mem adds only 4.87 M parameters (0.12 %) to a 3.6 B‑parameter model and 9.65 M (0.10 %) to an 8.2 B model. GPU memory usage matches that of the vanilla model and Context2LoRA even with 32K prompts. Decoding throughput is slightly lower because each generation step reads and writes the state, but the overhead remains within acceptable engineering limits.
Limitations
Evaluations were limited to interactions of up to tens of thousands of tokens; the capacity ceiling of an 8×8 matrix for longer or more complex multi‑turn scenarios remains an open question.
Paper: https://arxiv.org/abs/2605.12357
Code: https://github.com/MindLab-Research/delta-Mem
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
