LightMem (ICLR 2026): Cutting the Cost of Long‑Term Memory for Large Language Models
LightMem proposes a three‑stage, human‑inspired memory pipeline that dramatically lowers token usage, API calls, and latency while preserving accuracy, achieving up to 7.7% higher scores and 30‑plus‑fold cost reductions on long‑context benchmarks.
Large language models excel in many tasks, but in real‑world multi‑turn, multi‑task interactions they quickly hit two classic problems: limited context windows and the "lost in the middle" issue, making external memory systems essential yet prohibitively expensive.
Why Existing Memory Systems Are Too Costly
Typical LLM memory pipelines split dialogues into turns or sessions, summarize or extract each segment, store results in a vector store or knowledge graph, and perform online updates (add/delete/merge/ignore) before retrieval. This approach suffers from three major drawbacks:
Redundant information (greetings, confirmations, repeated explanations) floods the pipeline, inflating token consumption and potentially harming in‑context learning.
Rigid segmentation granularity: fine‑grained turns cause a explosion of summarization calls, while coarse sessions mix topics and degrade summary quality.
Heavy online updates bind costly operations to test time, increasing latency and risking accidental deletion of useful information.
LightMem’s Core Idea: A Three‑Stage “Human‑Like” Memory Pipeline
LightMem decomposes memory into three lightweight modules—Light1 (Sensory Memory), Light2 (Short‑Term Memory, STM), and Light3 (Long‑Term Memory + Sleep‑Time Update)—mirroring the human memory hierarchy of sensory filtering, short‑term organization, and offline consolidation.
Light1: Sensory Memory – Filter and Topic‑Split
A lightweight compression model (LLMLingua‑2 by default) pre‑filters raw input, discarding redundant tokens while preserving high‑information content. Experiments show that with compression rates of 50‑80 % the LLM still understands the compressed context with negligible accuracy loss. Topic boundaries are then identified by combining attention peaks with semantic similarity checks, producing a refined set of segment points that avoid naïve window cuts.
Light2: Short‑Term Memory – Topic‑Aware Buffering
Segments are stored as {topic, turns} in an STM buffer. Summarization is triggered only when the buffer reaches a token threshold, yielding structured topic summaries that are written to LTM. This reduces the number of summarization calls and improves accuracy because each summary is constrained by a coherent topic, preventing cross‑topic contamination. Ablation studies confirm that removing topic segmentation drops accuracy for both GPT and Qwen backbones.
Light3: Long‑Term Memory + Sleep‑Time Update – Offline Consolidation
Online updates are limited to soft inserts without conflict resolution. During offline “sleep” phases, each memory entry builds an update queue (new updates may only overwrite older ones, respecting timestamps). Queues are processed in parallel, eliminating the sequential read‑write bottleneck of traditional online updates and dramatically lowering latency and error risk.
Experimental Results
LightMem was evaluated on two long‑context benchmarks, LongMemEval (and its short variant) and LoCoMo, using three backbones: GPT‑4o‑mini, Qwen‑3‑30B‑A3B, and GLM‑4.6. The paper reports:
Accuracy improvements of up to 7.7 % / 29.3 % over strong baselines, depending on setting and backbone.
Token consumption reductions of up to 38× / 20.9× and API‑call reductions of up to 30× / 55.5×.
When measuring only test‑time cost, token savings reach 106×‑117× and API‑call savings reach 159×‑310×.
These gains demonstrate that LightMem delivers both higher performance and substantially lower operational cost.
Conclusion
LightMem offers a pragmatic, lightweight memory system for long‑dialogue agents, emphasizing human‑like memory division: aggressive front‑end filtering, topic‑aware short‑term buffering, and offline parallel consolidation. It makes long‑term memory feasible for production agents without sacrificing accuracy.
Paper: LightMem: Lightweight and Efficient Memory‑Augmented Generation (arXiv:2510.18866). Code: https://github.com/zjunlp/LightMem.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
