How to Build a Robust Agent Memory System: Architecture, Management, and Evaluation
This article provides a comprehensive guide to designing, implementing, and evaluating an Agent Memory module for large‑language‑model assistants, covering memory types, short‑ and long‑term storage, conflict resolution, hybrid retrieval, compliance, and practical interview answers.
Why Agents Need Memory
In a corporate‑client banking assistant, users have multi‑turn interactions that require the system to retain preferences, qualifications, and historical questions. Without memory, users must repeat information, increasing dialogue rounds and reducing satisfaction. LLMs are stateless functions with limited context windows and no cross‑session memory, so a memory system upgrades an LLM from a stateless function to a stateful agent.
Cognitive‑Science View: Three Memory Types
Semantic Memory
General world knowledge not tied to a specific time or person (e.g., product descriptions, regulatory policies, common Q&A). Stored in a shared vector database and retrieved by semantic similarity.
"Ping An Bank corporate wealth product minimum investment is 1 million CNY, minimum holding period 30 days."Episodic Memory
Specific events bound to a user and time (e.g., individual user inquiries about insurance coverage, company capital, or product preferences). Must be stored per‑user and filtered by user_id before similarity search.
Procedural Memory
Operational rules and workflows (the "how to"). Implemented as system prompts injected at the start of each conversation; updates are made by modifying the prompt, not by retrieval.
Short‑Term vs Long‑Term Memory
Short‑Term Memory (Conversation Window)
Keeps the most recent N dialogue turns in the LLM context window.
class ShortTermMemory:
def __init__(self, window_size: int = 10):
self.window_size = window_size
self.messages = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.window_size * 2:
self.messages = self.messages[-self.window_size * 2:]
def get_context(self) -> list:
return self.messagesShort‑term memory disappears after the session ends, which is why long‑term storage is required.
Long‑Term Memory (Vector Database)
Implemented with Milvus. Records are partitioned by memory type and user_id. Example schema:
schema = {
"collection_name": "agent_memory",
"fields": [
{"name": "memory_id", "type": "VARCHAR", "max_length": 64},
{"name": "user_id", "type": "VARCHAR", "max_length": 64},
{"name": "memory_type", "type": "VARCHAR", "max_length": 32},
{"name": "content", "type": "VARCHAR", "max_length": 2048},
{"name": "embedding", "type": "FLOAT_VECTOR", "dim": 1536},
{"name": "created_at", "type": "INT64"},
{"name": "importance_score", "type": "FLOAT"},
{"name": "ttl", "type": "INT64"},
{"name": "is_deleted", "type": "BOOL"}
]
}Semantic memories use a global user_id (e.g., "global") shared across users; episodic memories bind the actual user ID for isolation.
Memory Extraction at Conversation End
After each session, a prompt extracts valuable episodic facts into JSON, assigning an importance score (1‑10) that later influences retrieval weighting.
MEMORY_EXTRACTION_PROMPT = """
You are a memory extraction assistant. Extract from the conversation only:
1. Explicit user preferences
2. Basic user info (company size, industry, location)
3. Important decisions or requirement changes
4. Key events useful for future queries
Do NOT extract trivial Q&A or system replies.
Conversation:
{conversation}
Return JSON with fields: content, memory_type, importance (1‑10).
"""Key Research Papers
Generative Agents (2023)
Introduces Memory Stream, Reflection, and Planning. Two engineering‑relevant components:
Importance scoring : LLM rates each new memory 1‑10; retrieval combines semantic similarity, recency decay, and importance.
def compute_retrieval_score(memory, query_embedding, current_time, decay_rate=0.995):
semantic_score = cosine_similarity(query_embedding, memory["embedding"])
hours_passed = (current_time - memory["created_at"]) / 3600
recency_score = decay_rate ** hours_passed
importance_score = memory["importance_score"] / 10.0
alpha, beta, gamma = 0.4, 0.3, 0.3
return alpha * semantic_score + beta * recency_score + gamma * importance_scoreReflection triggers when accumulated importance exceeds a threshold, prompting the LLM to summarize multiple episodic memories into higher‑level abstractions.
MemGPT (2023)
Applies virtual‑memory concepts to LLMs, providing self‑managed read/write calls such as store_memory(content) and recall_memory(query).
Mem0 (2025)
Shows production‑grade gains: 26% higher accuracy, 91% lower latency, and >90% token cost reduction versus a baseline OpenAI approach, thanks to intelligent deduplication and compression.
Mem0 Framework: Four Memory Operations
from mem0 import Memory
memory = Memory()
# ADD
memory.add("User prefers low‑risk products, rejects stocks", user_id="user_001")
# UPDATE
memory.update(memory_id="mem_xxx", data="User upgraded to VIP, credit limit 200k")
# DELETE
memory.delete(memory_id="mem_xxx") # e.g., after account closure
# NOOP – automatically handled when content is unchangedThe framework decides the operation by comparing new info with existing memories using semantic similarity (>0.85) and LLM confirmation.
MEMORY_DECISION_PROMPT = """
You are a memory manager. Determine the action for new info:
- ADD: brand‑new information
- UPDATE: same entity, changed content
- DELETE: information is now invalid
- NOOP: identical to existing memory
Provide JSON: {"action": "...", "target_memory_id": "...", "reason": "..."}
"""Handling Memory Conflicts
When a user changes a fact (e.g., number of children), the system runs a semantic‑similarity check; if similarity > 0.85, it treats the update as UPDATE rather than adding a contradictory record.
async def check_memory_conflict(new_memory, existing_memories, similarity_threshold=0.85):
if not existing_memories:
return {"action": "ADD", "conflict_memory_id": None}
new_emb = await embedder.aembed_query(new_memory)
for mem in existing_memories:
if cosine_similarity(new_emb, mem["embedding"]) > similarity_threshold:
if await llm_confirm_update(new_memory, mem["content"]):
return {"action": "UPDATE", "conflict_memory_id": mem["memory_id"]}
return {"action": "ADD", "conflict_memory_id": None}TTL (Time‑to‑Live) Management
Time‑sensitive facts receive a TTL; a daily cleanup job marks expired records as soft‑deleted.
def add_memory_with_ttl(content, user_id, ttl_days=-1):
ttl_timestamp = -1
if ttl_days > 0:
ttl_timestamp = int(time.time()) + ttl_days * 86400
record = {
"memory_id": str(uuid.uuid4()),
"user_id": user_id,
"content": content,
"created_at": int(time.time()),
"ttl": ttl_timestamp,
"is_deleted": False,
}
milvus_client.insert("agent_memory", record)
async def cleanup_expired_memories():
now = int(time.time())
expired = milvus_client.query(
collection_name="agent_memory",
filter=f"ttl > 0 && ttl < {now} && is_deleted == false"
)
for mem in expired:
milvus_client.update(collection_name="agent_memory", filter=f"memory_id == '{mem['memory_id']}'", data={"is_deleted": True})Privacy & "Right to be Forgotten"
For compliance, a soft‑delete flag plus an immutable audit log satisfy GDPR‑like requirements.
async def forget_user(user_id, operator, reason):
milvus_client.update(
collection_name="agent_memory",
filter=f"user_id == '{user_id}' && is_deleted == false",
data={"is_deleted": True}
)
audit_log = {
"operation": "USER_FORGET",
"user_id": user_id,
"operator": operator,
"reason": reason,
"timestamp": int(time.time()),
}
audit_db.insert(audit_log)
return {"status": "success", "message": f"Deleted all memories of {user_id}"}Hybrid Retrieval Formula
Combines semantic similarity, recency decay, and importance weighting.
def hybrid_memory_retrieval(query, user_id, top_k=5, alpha=0.4, beta=0.3, gamma=0.3):
candidates = milvus_client.search(
collection_name="agent_memory",
data=[get_embedding(query)],
filter=f"user_id == '{user_id}' && is_deleted == false",
limit=top_k * 3,
output_fields=["memory_id", "content", "created_at", "importance_score"]
)
now = time.time()
scored = []
for mem, semantic_score in zip(candidates[0], candidates[1]):
recency = 0.995 ** ((now - mem["created_at"]) / 3600)
importance = mem["importance_score"] / 10.0
final = alpha * semantic_score + beta * recency + gamma * importance
scored.append((mem, final))
scored.sort(key=lambda x: x[1], reverse=True)
return [m for m, _ in scored[:top_k]]This reduces retrieval of outdated but semantically similar memories.
Memory Strength Update (Ebbinghaus Forgetting)
Each successful retrieval slightly boosts the importance score; unused memories decay daily and are eventually soft‑deleted.
def update_memory_strength(memory_id, was_retrieved):
mem = milvus_client.get(memory_id)
if was_retrieved:
milvus_client.update(memory_id, {
"last_accessed": time.time(),
"importance_score": min(10.0, mem["importance_score"] + 0.5)
})
else:
days = (time.time() - mem["last_accessed"]) / 86400
new_imp = mem["importance_score"] * (0.95 ** days)
if new_imp < 1.0:
milvus_client.update(memory_id, {"is_deleted": True})
else:
milvus_client.update(memory_id, {"importance_score": new_imp})Evaluation with LOCOMO Benchmark
Four dimensions are measured:
Retrieval accuracy (Top‑5 recall)
Information timeliness (conflict detection rate)
Privacy isolation (zero cross‑user leakage)
Storage efficiency (ratio of useful memories)
In the banking project, hybrid retrieval raised Top‑5 recall from 71% to 87%, and conflict detection reached 92%.
Interview‑Ready Answer Framework
Classification : Explain semantic, episodic, and procedural memory.
Architecture : Short‑term sliding window + long‑term Milvus partitions with importance scores.
Management : ADD/UPDATE/DELETE/NOOP logic, TTL cleanup, forgetting curve, soft‑delete + audit for compliance.
Retrieval : Hybrid scoring (semantic + recency + importance) before injecting into the system prompt.
Compliance : Right‑to‑be‑forgotten implementation.
Evaluation : Cite LOCOMO metrics and observed improvements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
