How to Design an Effective Agent Memory System for Enterprise AI Assistants
This article explains why AI agents need a structured memory module, outlines three memory types from cognitive science, details short‑term and long‑term storage architectures using vector databases, and provides concrete code and management strategies—including conflict resolution, TTL expiration, and privacy compliance—to build a robust Agent Memory system.
Why Agents Need Memory
In enterprise banking assistants like "拓业智询", users have complex, long‑term needs; without memory they must repeat information, leading to poor experience. Experiments showed that adding a memory module reduced dialogue rounds by 2.1 and increased satisfaction by 23%.
Cognitive‑Science Perspective: Three Memory Types
Semantic Memory
General knowledge about the world, shared across users. Examples include product specifications, regulatory policies, and common FAQs. Stored in a shared vector database and retrieved by semantic similarity.
"平安银行对公理财产品的起购金额为100万元,最短持有期30天。"Episodic Memory
User‑specific historical facts tied to a person or entity, such as a user’s previous insurance queries or company details. Requires isolation by user_id and high‑frequency updates.
用户张三在 2024 年 3 月询问过重疾险的保障范围
用户李四的企业注册资本 5000 万,属于中型企业客户
用户王五明确表示不接受股票类高风险产品
用户赵六上次咨询时提到公司准备做跨境业务
Procedural Memory
Operational rules and workflows (the "how to"), such as claim processing steps or product recommendation priorities. Implemented as static system prompts injected at the start of each conversation.
处理理赔咨询时,先查保单状态,再查对应条款
涉及高净值客户时,优先推荐私行产品线
客户首次咨询时,先确认企业性质和注册地
Short‑Term vs Long‑Term Memory Architecture
Short‑Term Memory (Conversation Window)
Keeps the most recent N dialogue turns in the LLM context window using a sliding window.
class ShortTermMemory:
def __init__(self, window_size: int = 10):
self.window_size = window_size
self.messages = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
if len(self.messages) > self.window_size * 2:
self.messages = self.messages[-self.window_size * 2:]
def get_context(self) -> list:
return self.messagesLong‑Term Memory (Vector Database)
Stored in Milvus with a schema that partitions by memory_type and user_id to ensure isolation and efficient retrieval.
# Milvus collection schema
schema = {
"collection_name": "agent_memory",
"fields": [
{"name": "memory_id", "type": "VARCHAR", "max_length": 64},
{"name": "user_id", "type": "VARCHAR", "max_length": 64},
{"name": "memory_type", "type": "VARCHAR", "max_length": 32},
{"name": "content", "type": "VARCHAR", "max_length": 2048},
{"name": "embedding", "type": "FLOAT_VECTOR", "dim": 1536},
{"name": "created_at", "type": "INT64"},
{"name": "ttl", "type": "INT64"},
{"name": "is_deleted", "type": "BOOL"}
]
}Semantic memory uses a global user_id value, while episodic memory stores the actual user ID for filtering.
Memory Extraction Prompt
MEMORY_EXTRACTION_PROMPT = """\
你是一个记忆提取助手。请从以下对话中提取值得长期记忆的信息。\
只提取以下类型的信息:\
1. 用户明确表达的偏好\
2. 用户的基本信息\
3. 用户的重要决策或需求变化\
4. 对未来咨询有参考价值的关键事件\
不要提取普通问答、系统回复或闲聊。\
对话内容:{conversation}\
请以JSON格式输出,每条记忆包含:content、memory_type(episodic/semantic)\
"""
async def extract_memories(conversation: list, user_id: str) -> list:
prompt = MEMORY_EXTRACTION_PROMPT.format(conversation="
".join([f"{m['role']}: {m['content']}" for m in conversation]))
response = await llm.ainvoke(prompt)
memories = json.loads(response.content)
return memoriesMem0 Framework: Four Memory Operations
# Mem0 memory management interface
from mem0 import Memory
memory = Memory()
# ADD: new information
memory.add("用户偏好低风险产品,不接受股票类投资", user_id="user_001")
# UPDATE: changed information
memory.update(memory_id="mem_xxx", data="用户已升级为VIP客户,授信额度从50万提升到200万")
# DELETE: obsolete information
memory.delete(memory_id="mem_xxx")
# NOOP: identical information, no action neededThe framework decides the appropriate action by comparing new info with existing memories using LLM‑driven semantic similarity.
MEMORY_DECISION_PROMPT = """\
你是一个记忆管理助手。请判断以下新信息与现有记忆的关系,并决定操作类型。\
新信息:{new_info}\
现有记忆(相关度最高的5条):{existing_memories}\
请判断:\
- ADD:全新信息\
- UPDATE:信息更新\
- DELETE:信息失效\
- NOOP:信息相同\
输出JSON:{{"action": "ADD/UPDATE/DELETE/NOOP", "target_memory_id": "...", "reason": "..."}}\
"""Conflict Handling Strategies
Semantic deduplication : compare embeddings; similarity > 0.85 triggers an UPDATE instead of ADD.
async def check_memory_conflict(new_memory: str, existing_memories: list, similarity_threshold: float = 0.85) -> dict:
if not existing_memories:
return {"action": "ADD", "conflict_memory_id": None}
new_embedding = await embedder.aembed_query(new_memory)
for mem in existing_memories:
similarity = cosine_similarity(new_embedding, mem["embedding"])
if similarity > similarity_threshold:
is_update = await llm_confirm_update(new_memory, mem["content"])
if is_update:
return {"action": "UPDATE", "conflict_memory_id": mem["memory_id"]}
return {"action": "ADD", "conflict_memory_id": None}TTL and Expiration
# Add memory with TTL (in days, -1 = permanent)
async def add_memory_with_ttl(content: str, user_id: str, ttl_days: int = -1):
ttl_timestamp = -1
if ttl_days > 0:
ttl_timestamp = int(time.time()) + ttl_days * 86400
memory_record = {
"memory_id": str(uuid.uuid4()),
"user_id": user_id,
"content": content,
"created_at": int(time.time()),
"ttl": ttl_timestamp,
"is_deleted": False
}
milvus_client.insert("agent_memory", memory_record)
# Daily cleanup task
async def cleanup_expired_memories():
current_time = int(time.time())
expired = milvus_client.query(
collection_name="agent_memory",
filter=f"ttl > 0 && ttl < {current_time} && is_deleted == false"
)
for mem in expired:
milvus_client.update(
collection_name="agent_memory",
filter=f"memory_id == '{mem['memory_id']}'",
data={"is_deleted": True}
)Privacy & "Right to be Forgotten"
async def forget_user(user_id: str, operator: str, reason: str):
milvus_client.update(
collection_name="agent_memory",
filter=f"user_id == '{user_id}' && is_deleted == false",
data={"is_deleted": True}
)
audit_log = {
"operation": "USER_FORGET",
"user_id": user_id,
"operator": operator,
"reason": reason,
"timestamp": int(time.time()),
"memory_count": deleted_count
}
audit_db.insert(audit_log)
return {"status": "success", "message": f"已删除用户{user_id}的所有记忆数据"}Audit logs remain for compliance while the actual memory content is soft‑deleted.
Memory Retrieval & Prompt Injection
async def retrieve_memory(query: str, user_id: str) -> str:
relevant_memories = memory.search(query=query, user_id=user_id, limit=5)
if not relevant_memories:
return ""
memory_context = "
".join([f"- {m['memory']} (记录于{m['created_at']})" for m in relevant_memories])
return f"【用户历史信息】 以下是该用户的历史偏好和关键信息,请在回答时参考:
{memory_context}"
async def chat(user_message: str, user_id: str, session_id: str):
memory_context = await retrieve_memory(user_message, user_id)
system_prompt = BASE_SYSTEM_PROMPT
if memory_context:
system_prompt += "
" + memory_context
response = await llm.ainvoke(messages=[
{"role": "system", "content": system_prompt},
*short_term_memory.get_context(),
{"role": "user", "content": user_message}
])
short_term_memory.add_message("user", user_message)
short_term_memory.add_message("assistant", response.content)
return response.contentKey points: always filter by user_id before vector search, limit results based on prompt budget, and attach timestamps for LLM to assess recency.
Four‑Step Memory Management Process
Define : Identify valuable information (preferences, basic info, key decisions, actionable context).
Write : Extract with LLM and store in the appropriate partition (episodic → vector DB, semantic → shared DB, procedural → system prompt).
Manage : Before writing, run conflict detection to choose ADD/UPDATE/DELETE/NOOP; apply TTL for time‑sensitive data; handle soft deletion for compliance.
Read : At conversation start, retrieve top‑k relevant long‑term memories, format them, inject into the system prompt, then combine with short‑term context and call the LLM.
Interview Answer Blueprint
When asked about Agent Memory design, structure your response around the five layers presented: classification of memory types, dual‑layer architecture (short‑term + long‑term), management operations (Mem0 actions), retrieval injection, and compliance considerations.
Common Pitfall: Mixing Memory with RAG
RAG retrieves generic knowledge from a knowledge base, while Memory retrieves user‑specific historical facts. Both are needed but serve distinct purposes and should not be conflated.
Summary
The design can be captured in three dimensions:
Classification : Semantic, Episodic, Procedural – each with its own storage and retrieval strategy.
Architecture : Short‑term sliding window + Long‑term vector DB partitioned by user_id.
Management : Conflict detection (ADD/UPDATE/DELETE/NOOP), TTL‑based expiration, and soft‑delete with audit logs for the right‑to‑be‑forgotten.
Answering the interview questions about where memory lives, how it is retrieved, when it is updated, and when it is deleted becomes straightforward with this framework.
Wu Shixiong's Large Model Academy
We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
