Artificial Intelligence 23 min read

How to Design an Effective Agent Memory System for Enterprise AI Assistants

This article explains why AI agents need a structured memory module, outlines three memory types from cognitive science, details short‑term and long‑term storage architectures using vector databases, and provides concrete code and management strategies—including conflict resolution, TTL expiration, and privacy compliance—to build a robust Agent Memory system.

Wu Shixiong's Large Model Academy

Apr 1, 2026

How to Design an Effective Agent Memory System for Enterprise AI Assistants

Why Agents Need Memory

In enterprise banking assistants like "拓业智询", users have complex, long‑term needs; without memory they must repeat information, leading to poor experience. Experiments showed that adding a memory module reduced dialogue rounds by 2.1 and increased satisfaction by 23%.

Cognitive‑Science Perspective: Three Memory Types

Semantic Memory

General knowledge about the world, shared across users. Examples include product specifications, regulatory policies, and common FAQs. Stored in a shared vector database and retrieved by semantic similarity.

"平安银行对公理财产品的起购金额为100万元，最短持有期30天。"

Episodic Memory

User‑specific historical facts tied to a person or entity, such as a user’s previous insurance queries or company details. Requires isolation by user_id and high‑frequency updates.

用户张三在 2024 年 3 月询问过重疾险的保障范围

用户李四的企业注册资本 5000 万，属于中型企业客户

用户王五明确表示不接受股票类高风险产品

用户赵六上次咨询时提到公司准备做跨境业务

Procedural Memory

Operational rules and workflows (the "how to"), such as claim processing steps or product recommendation priorities. Implemented as static system prompts injected at the start of each conversation.

处理理赔咨询时，先查保单状态，再查对应条款

涉及高净值客户时，优先推荐私行产品线

客户首次咨询时，先确认企业性质和注册地

Short‑Term vs Long‑Term Memory Architecture

Short‑Term Memory (Conversation Window)

Keeps the most recent N dialogue turns in the LLM context window using a sliding window.

class ShortTermMemory:
    def __init__(self, window_size: int = 10):
        self.window_size = window_size
        self.messages = []
    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.window_size * 2:
            self.messages = self.messages[-self.window_size * 2:]
    def get_context(self) -> list:
        return self.messages

Long‑Term Memory (Vector Database)

Stored in Milvus with a schema that partitions by memory_type and user_id to ensure isolation and efficient retrieval.

# Milvus collection schema
schema = {
    "collection_name": "agent_memory",
    "fields": [
        {"name": "memory_id", "type": "VARCHAR", "max_length": 64},
        {"name": "user_id", "type": "VARCHAR", "max_length": 64},
        {"name": "memory_type", "type": "VARCHAR", "max_length": 32},
        {"name": "content", "type": "VARCHAR", "max_length": 2048},
        {"name": "embedding", "type": "FLOAT_VECTOR", "dim": 1536},
        {"name": "created_at", "type": "INT64"},
        {"name": "ttl", "type": "INT64"},
        {"name": "is_deleted", "type": "BOOL"}
    ]
}

Semantic memory uses a global user_id value, while episodic memory stores the actual user ID for filtering.

Memory Extraction Prompt

MEMORY_EXTRACTION_PROMPT = """\
你是一个记忆提取助手。请从以下对话中提取值得长期记忆的信息。\
只提取以下类型的信息：\
1. 用户明确表达的偏好\
2. 用户的基本信息\
3. 用户的重要决策或需求变化\
4. 对未来咨询有参考价值的关键事件\
不要提取普通问答、系统回复或闲聊。\
对话内容：{conversation}\
请以JSON格式输出，每条记忆包含：content、memory_type（episodic/semantic）\
"""
async def extract_memories(conversation: list, user_id: str) -> list:
    prompt = MEMORY_EXTRACTION_PROMPT.format(conversation="
".join([f"{m['role']}: {m['content']}" for m in conversation]))
    response = await llm.ainvoke(prompt)
    memories = json.loads(response.content)
    return memories

Mem0 Framework: Four Memory Operations

# Mem0 memory management interface
from mem0 import Memory
memory = Memory()
# ADD: new information
memory.add("用户偏好低风险产品，不接受股票类投资", user_id="user_001")
# UPDATE: changed information
memory.update(memory_id="mem_xxx", data="用户已升级为VIP客户，授信额度从50万提升到200万")
# DELETE: obsolete information
memory.delete(memory_id="mem_xxx")
# NOOP: identical information, no action needed

The framework decides the appropriate action by comparing new info with existing memories using LLM‑driven semantic similarity.

MEMORY_DECISION_PROMPT = """\
你是一个记忆管理助手。请判断以下新信息与现有记忆的关系，并决定操作类型。\
新信息：{new_info}\
现有记忆（相关度最高的5条）：{existing_memories}\
请判断：\
- ADD：全新信息\
- UPDATE：信息更新\
- DELETE：信息失效\
- NOOP：信息相同\
输出JSON：{{"action": "ADD/UPDATE/DELETE/NOOP", "target_memory_id": "...", "reason": "..."}}\
"""

Conflict Handling Strategies

Semantic deduplication : compare embeddings; similarity > 0.85 triggers an UPDATE instead of ADD.

async def check_memory_conflict(new_memory: str, existing_memories: list, similarity_threshold: float = 0.85) -> dict:
    if not existing_memories:
        return {"action": "ADD", "conflict_memory_id": None}
    new_embedding = await embedder.aembed_query(new_memory)
    for mem in existing_memories:
        similarity = cosine_similarity(new_embedding, mem["embedding"])
        if similarity > similarity_threshold:
            is_update = await llm_confirm_update(new_memory, mem["content"])
            if is_update:
                return {"action": "UPDATE", "conflict_memory_id": mem["memory_id"]}
    return {"action": "ADD", "conflict_memory_id": None}

TTL and Expiration

# Add memory with TTL (in days, -1 = permanent)
async def add_memory_with_ttl(content: str, user_id: str, ttl_days: int = -1):
    ttl_timestamp = -1
    if ttl_days > 0:
        ttl_timestamp = int(time.time()) + ttl_days * 86400
    memory_record = {
        "memory_id": str(uuid.uuid4()),
        "user_id": user_id,
        "content": content,
        "created_at": int(time.time()),
        "ttl": ttl_timestamp,
        "is_deleted": False
    }
    milvus_client.insert("agent_memory", memory_record)

# Daily cleanup task
async def cleanup_expired_memories():
    current_time = int(time.time())
    expired = milvus_client.query(
        collection_name="agent_memory",
        filter=f"ttl > 0 && ttl < {current_time} && is_deleted == false"
    )
    for mem in expired:
        milvus_client.update(
            collection_name="agent_memory",
            filter=f"memory_id == '{mem['memory_id']}'",
            data={"is_deleted": True}
        )

Privacy & "Right to be Forgotten"

async def forget_user(user_id: str, operator: str, reason: str):
    milvus_client.update(
        collection_name="agent_memory",
        filter=f"user_id == '{user_id}' && is_deleted == false",
        data={"is_deleted": True}
    )
    audit_log = {
        "operation": "USER_FORGET",
        "user_id": user_id,
        "operator": operator,
        "reason": reason,
        "timestamp": int(time.time()),
        "memory_count": deleted_count
    }
    audit_db.insert(audit_log)
    return {"status": "success", "message": f"已删除用户{user_id}的所有记忆数据"}

Audit logs remain for compliance while the actual memory content is soft‑deleted.

Memory Retrieval & Prompt Injection

async def retrieve_memory(query: str, user_id: str) -> str:
    relevant_memories = memory.search(query=query, user_id=user_id, limit=5)
    if not relevant_memories:
        return ""
    memory_context = "
".join([f"- {m['memory']} (记录于{m['created_at']})" for m in relevant_memories])
    return f"【用户历史信息】 以下是该用户的历史偏好和关键信息，请在回答时参考：
{memory_context}"

async def chat(user_message: str, user_id: str, session_id: str):
    memory_context = await retrieve_memory(user_message, user_id)
    system_prompt = BASE_SYSTEM_PROMPT
    if memory_context:
        system_prompt += "

" + memory_context
    response = await llm.ainvoke(messages=[
        {"role": "system", "content": system_prompt},
        *short_term_memory.get_context(),
        {"role": "user", "content": user_message}
    ])
    short_term_memory.add_message("user", user_message)
    short_term_memory.add_message("assistant", response.content)
    return response.content

Key points: always filter by user_id before vector search, limit results based on prompt budget, and attach timestamps for LLM to assess recency.

Four‑Step Memory Management Process

Define : Identify valuable information (preferences, basic info, key decisions, actionable context).

Write : Extract with LLM and store in the appropriate partition (episodic → vector DB, semantic → shared DB, procedural → system prompt).

Manage : Before writing, run conflict detection to choose ADD/UPDATE/DELETE/NOOP; apply TTL for time‑sensitive data; handle soft deletion for compliance.

Read : At conversation start, retrieve top‑k relevant long‑term memories, format them, inject into the system prompt, then combine with short‑term context and call the LLM.

Interview Answer Blueprint

When asked about Agent Memory design, structure your response around the five layers presented: classification of memory types, dual‑layer architecture (short‑term + long‑term), management operations (Mem0 actions), retrieval injection, and compliance considerations.

Common Pitfall: Mixing Memory with RAG

RAG retrieves generic knowledge from a knowledge base, while Memory retrieves user‑specific historical facts. Both are needed but serve distinct purposes and should not be conflated.

Summary

The design can be captured in three dimensions:

Classification : Semantic, Episodic, Procedural – each with its own storage and retrieval strategy.

Architecture : Short‑term sliding window + Long‑term vector DB partitioned by user_id.

Management : Conflict detection (ADD/UPDATE/DELETE/NOOP), TTL‑based expiration, and soft‑delete with audit logs for the right‑to‑be‑forgotten.

Answering the interview questions about where memory lives, how it is retrieved, when it is updated, and when it is deleted becomes straightforward with this framework.

memory management LLM vector database Milvus Agent Memory Mem0

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.