Artificial Intelligence 10 min read

LightMem (ICLR 2026): Cutting the Cost of Long‑Term Memory for Large Language Models

LightMem proposes a three‑stage, human‑inspired memory pipeline that dramatically lowers token usage, API calls, and latency while preserving accuracy, achieving up to 7.7% higher scores and 30‑plus‑fold cost reductions on long‑context benchmarks.

Machine Learning Algorithms & Natural Language Processing

Feb 26, 2026

LightMem (ICLR 2026): Cutting the Cost of Long‑Term Memory for Large Language Models

Large language models excel in many tasks, but in real‑world multi‑turn, multi‑task interactions they quickly hit two classic problems: limited context windows and the "lost in the middle" issue, making external memory systems essential yet prohibitively expensive.

Why Existing Memory Systems Are Too Costly

Typical LLM memory pipelines split dialogues into turns or sessions, summarize or extract each segment, store results in a vector store or knowledge graph, and perform online updates (add/delete/merge/ignore) before retrieval. This approach suffers from three major drawbacks:

Redundant information (greetings, confirmations, repeated explanations) floods the pipeline, inflating token consumption and potentially harming in‑context learning.

Rigid segmentation granularity: fine‑grained turns cause a explosion of summarization calls, while coarse sessions mix topics and degrade summary quality.

Heavy online updates bind costly operations to test time, increasing latency and risking accidental deletion of useful information.

LightMem’s Core Idea: A Three‑Stage “Human‑Like” Memory Pipeline

LightMem decomposes memory into three lightweight modules—Light1 (Sensory Memory), Light2 (Short‑Term Memory, STM), and Light3 (Long‑Term Memory + Sleep‑Time Update)—mirroring the human memory hierarchy of sensory filtering, short‑term organization, and offline consolidation.

Light1: Sensory Memory – Filter and Topic‑Split

A lightweight compression model (LLMLingua‑2 by default) pre‑filters raw input, discarding redundant tokens while preserving high‑information content. Experiments show that with compression rates of 50‑80 % the LLM still understands the compressed context with negligible accuracy loss. Topic boundaries are then identified by combining attention peaks with semantic similarity checks, producing a refined set of segment points that avoid naïve window cuts.

Light2: Short‑Term Memory – Topic‑Aware Buffering

Segments are stored as {topic, turns} in an STM buffer. Summarization is triggered only when the buffer reaches a token threshold, yielding structured topic summaries that are written to LTM. This reduces the number of summarization calls and improves accuracy because each summary is constrained by a coherent topic, preventing cross‑topic contamination. Ablation studies confirm that removing topic segmentation drops accuracy for both GPT and Qwen backbones.

Light3: Long‑Term Memory + Sleep‑Time Update – Offline Consolidation

Online updates are limited to soft inserts without conflict resolution. During offline “sleep” phases, each memory entry builds an update queue (new updates may only overwrite older ones, respecting timestamps). Queues are processed in parallel, eliminating the sequential read‑write bottleneck of traditional online updates and dramatically lowering latency and error risk.

Experimental Results

LightMem was evaluated on two long‑context benchmarks, LongMemEval (and its short variant) and LoCoMo, using three backbones: GPT‑4o‑mini, Qwen‑3‑30B‑A3B, and GLM‑4.6. The paper reports:

Accuracy improvements of up to 7.7 % / 29.3 % over strong baselines, depending on setting and backbone.

Token consumption reductions of up to 38× / 20.9× and API‑call reductions of up to 30× / 55.5×.

When measuring only test‑time cost, token savings reach 106×‑117× and API‑call savings reach 159×‑310×.

These gains demonstrate that LightMem delivers both higher performance and substantially lower operational cost.

Conclusion

LightMem offers a pragmatic, lightweight memory system for long‑dialogue agents, emphasizing human‑like memory division: aggressive front‑end filtering, topic‑aware short‑term buffering, and offline parallel consolidation. It makes long‑term memory feasible for production agents without sacrificing accuracy.

Paper: LightMem: Lightweight and Efficient Memory‑Augmented Generation (arXiv:2510.18866). Code: https://github.com/zjunlp/LightMem.