Designing Scalable Long-Term Memory for AI Agents: Capture, Compress, Retrieve
This article explains how to build a controllable, editable, and cost‑effective long‑term memory system for AI agents by categorizing memory types, structuring a three‑stage pipeline of capture, AI‑driven compression, and smart retrieval, and choosing appropriate storage back‑ends such as files, knowledge bases, or databases.
1. Types of Long‑Term Memory
Long‑term memory for AI agents is not a simple log of past dialogues; it must keep user preferences, project context, and key decisions across days and weeks, control token and throughput costs, and allow errors to be corrected, edited, or forgotten.
User Long‑Term Memory : Stable facts (name, goals, preferences) that are injected on every turn. It must be auditable (who wrote it, when, from which turn) and reversible (explicit delete commands or GDPR‑style purge).
Task Memory : Temporary state that expires (e.g., multi‑day troubleshooting progress, PR discussion outcomes). It requires a TTL to avoid polluting retrieval.
Event/Operation Memory : High‑frequency logs from tool calls, file writes, or commands. Because of its noise, it is stored in layered caches (hot for recent events, cold for archived).
2. Three‑Stage Memory Pipeline
2.1 Memory Capture
Capture defines who writes, what is written, and where it is stored. It is split by event source:
Dialogue Events : User input, model output, session metadata.
Tool Events : Tool name, parameters, return values, and side effects (file changes, config updates).
User Explicit Commands : Direct write/delete/modify instructions that bypass summarisation.
Example hooks from Claude‑Mem ("Five Lifecycle Hooks"):
context‑hook – triggered at session start to inject recent memory.
new‑hook – triggered when the user asks a question to create a new session and save the prompt.
save‑hook – runs after a tool execution to capture file‑read/write actions.
summary‑hook – runs at session end to generate an AI summary and persist it.
cleanup‑hook – runs on stop commands to clean temporary data.
2.2 AI Compression
Compression reduces the amount of data injected into the prompt while keeping high‑quality retrieval. The goals are to lower token cost and improve retrieval controllability.
A practical approach is to summarise every 10 turns into a structured 200‑word chunk with fixed fields:
Goal/Constraints
Key Decision + Reason
Open Issues
Next Steps
Evidence Index (link to original event/log ID)
This structure lets the agent quickly decide whether a summary is useful and trace back to raw evidence when needed.
2.3 Smart Retrieval
Retrieval must produce usable context, not just any matching snippet. The engineering workflow splits retrieval into three phases:
Candidate Recall : vector similarity, keyword match, and structured filters (user, project, time window, tags).
Rerank : apply time decay, source trustworthiness, and memory‑type priority.
Injection Strategy : decide how many and which chunks to inject. A popular method is progressive disclosure:
Level 1: Recent 3 conversation summaries (~500 tokens)
Level 2: Relevant observation records (user‑queried)
Level 3: Full history search (mem‑search skill)Level 1 covers most continuous dialogues; Level 2 is used when the user explicitly asks for details; Level 3 is a fallback for low‑confidence situations, keeping token usage under control.
3. Choosing a Storage Medium
Files : Simple and auditable via Git, but poor concurrency and requires custom indexing for retrieval.
Knowledge Bases : Ideal for stable SOPs, product manuals, or FAQ content; not suited for high‑frequency writes because ingestion pipelines become a bottleneck.
Databases :
Structured DB (relational/document) for user memory, task state, and permission‑controlled data.
Vector DB for episodic memory and semantic search, though it inherits the three engineering challenges of noise, cost, and scalability.
A hybrid design stores editable user and task facts in a structured DB, keeps logs/evidence in files or a document store, and uses a vector DB only for semantic recall of high‑value events.
4. Vector Database Usage Patterns
4.1 What to Remember?
Avoid indiscriminate logging. Apply three filters:
Hard Rules : Discard clearly irrelevant items (temporary files, one‑time caches, sensitive data).
Importance Scoring : Score candidates by task relevance, explicit user tags, frequency, and impact of the tool action.
Layering Strategy : Scores decide placement in hot, warm, or cold layers (e.g., MEMORY.md for curated long‑term facts).
4.2 How to Layer Storage?
Three layers are recommended:
Hot Layer : Recent N days + latest summaries; low latency, fast writes, lightweight indexing.
Warm Layer : Key summaries and decisions from the current project cycle; read‑heavy, more precise indexing.
Cold Layer : Archived long‑term history; accessed only on explicit user request or low‑confidence fallback.
4.3 Write Frequency vs. Indexing Cost
Separate immediate‑retrieval writes from high‑precision batch indexing:
New entries first go to a lightweight delta store (in‑memory cache or a real‑time collection with write‑optimized parameters).
Periodically run an asynchronous compaction to merge deltas into the main index.
During retrieval, query both delta and main stores and de‑duplicate results.
5. Non‑Vector Long‑Term Memory Strategies
5.1 User Memory Schema
Each user memory record should contain:
key (e.g., coding_lang)
value (e.g., Python)
source (explicit command, implicit extraction, admin UI)
updated_at timestamp
version for rollback
confidence / policy tag (auto‑injectable, sensitive, etc.)
Injection policy: only whitelist keys are injected each turn; never dump the entire user profile.
5.2 Event / Log Storage
Store tool‑call logs and file‑change records as documents (or in a log system). Optionally add a vector index for fast similarity search, but retain the original document for auditability.
5.3 Hybrid Retrieval Pipeline
Effective non‑vector retrieval follows:
Structured filtering (user, project, time window, tags, source trust).
Similarity or full‑text recall.
Rerank with time decay, type weighting, and deduplication.
6. Usability Core of Long‑Term Memory
6.1 Fixed Injection Budget
Allocate a hard token budget per turn, for example:
User long‑term memory: 100–300 tokens.
Recent summaries: 300–800 tokens.
Retrieved chunks: 500–1500 tokens (adjusted by task importance).
Without a fixed budget, token overflow reduces the space for model reasoning and degrades response quality.
6.2 Conflict Handling
Never inject contradictory information. Resolve conflicts by:
Keeping only the latest version of a key.
Prioritising sources: explicit command > admin UI > implicit extraction.
Suppressing low‑confidence memories unless the user explicitly asks for them.
7. Summary
Long‑term memory for AI agents is a controllable, maintainable, and error‑correctable data pipeline rather than a raw dump of conversation history. By categorising memory (user facts, task state, event evidence), layering storage, and applying a three‑stage "Capture → AI Compression → Smart Retrieval/Injection" workflow, engineers can achieve consistent, cost‑effective, and reliable agent behaviour. Fixed injection budgets, progressive disclosure, and robust conflict resolution form the lower bound of system performance, while continuous monitoring of cost, quality, and safety metrics drives iterative improvement.
Architecture and Beyond
Focused on AIGC SaaS technical architecture and tech team management, sharing insights on architecture, development efficiency, team leadership, startup technology choices, large‑scale website design, and high‑performance, highly‑available, scalable solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
