Artificial Intelligence 32 min read

How to Build a Robust Agent Memory System: Architecture, Management, and Evaluation

This article provides a comprehensive guide to designing, implementing, and evaluating an Agent Memory module for large‑language‑model assistants, covering memory types, short‑ and long‑term storage, conflict resolution, hybrid retrieval, compliance, and practical interview answers.

Wu Shixiong's Large Model Academy

Apr 10, 2026

How to Build a Robust Agent Memory System: Architecture, Management, and Evaluation

Why Agents Need Memory

In a corporate‑client banking assistant, users have multi‑turn interactions that require the system to retain preferences, qualifications, and historical questions. Without memory, users must repeat information, increasing dialogue rounds and reducing satisfaction. LLMs are stateless functions with limited context windows and no cross‑session memory, so a memory system upgrades an LLM from a stateless function to a stateful agent.

Cognitive‑Science View: Three Memory Types

Semantic Memory

General world knowledge not tied to a specific time or person (e.g., product descriptions, regulatory policies, common Q&A). Stored in a shared vector database and retrieved by semantic similarity.

"Ping An Bank corporate wealth product minimum investment is 1 million CNY, minimum holding period 30 days."

Episodic Memory

Specific events bound to a user and time (e.g., individual user inquiries about insurance coverage, company capital, or product preferences). Must be stored per‑user and filtered by user_id before similarity search.

Procedural Memory

Operational rules and workflows (the "how to"). Implemented as system prompts injected at the start of each conversation; updates are made by modifying the prompt, not by retrieval.

Short‑Term vs Long‑Term Memory

Short‑Term Memory (Conversation Window)

Keeps the most recent N dialogue turns in the LLM context window.

class ShortTermMemory:
    def __init__(self, window_size: int = 10):
        self.window_size = window_size
        self.messages = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        if len(self.messages) > self.window_size * 2:
            self.messages = self.messages[-self.window_size * 2:]

    def get_context(self) -> list:
        return self.messages

Short‑term memory disappears after the session ends, which is why long‑term storage is required.

Long‑Term Memory (Vector Database)

Implemented with Milvus. Records are partitioned by memory type and user_id. Example schema:

schema = {
    "collection_name": "agent_memory",
    "fields": [
        {"name": "memory_id", "type": "VARCHAR", "max_length": 64},
        {"name": "user_id", "type": "VARCHAR", "max_length": 64},
        {"name": "memory_type", "type": "VARCHAR", "max_length": 32},
        {"name": "content", "type": "VARCHAR", "max_length": 2048},
        {"name": "embedding", "type": "FLOAT_VECTOR", "dim": 1536},
        {"name": "created_at", "type": "INT64"},
        {"name": "importance_score", "type": "FLOAT"},
        {"name": "ttl", "type": "INT64"},
        {"name": "is_deleted", "type": "BOOL"}
    ]
}

Semantic memories use a global user_id (e.g., "global") shared across users; episodic memories bind the actual user ID for isolation.

Memory Extraction at Conversation End

After each session, a prompt extracts valuable episodic facts into JSON, assigning an importance score (1‑10) that later influences retrieval weighting.

MEMORY_EXTRACTION_PROMPT = """
You are a memory extraction assistant. Extract from the conversation only:
1. Explicit user preferences
2. Basic user info (company size, industry, location)
3. Important decisions or requirement changes
4. Key events useful for future queries
Do NOT extract trivial Q&A or system replies.
Conversation:
{conversation}
Return JSON with fields: content, memory_type, importance (1‑10).
"""

Key Research Papers

Generative Agents (2023)

Introduces Memory Stream, Reflection, and Planning. Two engineering‑relevant components:

Importance scoring : LLM rates each new memory 1‑10; retrieval combines semantic similarity, recency decay, and importance.

def compute_retrieval_score(memory, query_embedding, current_time, decay_rate=0.995):
    semantic_score = cosine_similarity(query_embedding, memory["embedding"])
    hours_passed = (current_time - memory["created_at"]) / 3600
    recency_score = decay_rate ** hours_passed
    importance_score = memory["importance_score"] / 10.0
    alpha, beta, gamma = 0.4, 0.3, 0.3
    return alpha * semantic_score + beta * recency_score + gamma * importance_score

Reflection triggers when accumulated importance exceeds a threshold, prompting the LLM to summarize multiple episodic memories into higher‑level abstractions.

MemGPT (2023)

Applies virtual‑memory concepts to LLMs, providing self‑managed read/write calls such as store_memory(content) and recall_memory(query).

Mem0 (2025)

Shows production‑grade gains: 26% higher accuracy, 91% lower latency, and >90% token cost reduction versus a baseline OpenAI approach, thanks to intelligent deduplication and compression.

Mem0 Framework: Four Memory Operations

from mem0 import Memory
memory = Memory()
# ADD
memory.add("User prefers low‑risk products, rejects stocks", user_id="user_001")
# UPDATE
memory.update(memory_id="mem_xxx", data="User upgraded to VIP, credit limit 200k")
# DELETE
memory.delete(memory_id="mem_xxx")  # e.g., after account closure
# NOOP – automatically handled when content is unchanged

The framework decides the operation by comparing new info with existing memories using semantic similarity (>0.85) and LLM confirmation.

MEMORY_DECISION_PROMPT = """
You are a memory manager. Determine the action for new info:
- ADD: brand‑new information
- UPDATE: same entity, changed content
- DELETE: information is now invalid
- NOOP: identical to existing memory
Provide JSON: {"action": "...", "target_memory_id": "...", "reason": "..."}
"""

Handling Memory Conflicts

When a user changes a fact (e.g., number of children), the system runs a semantic‑similarity check; if similarity > 0.85, it treats the update as UPDATE rather than adding a contradictory record.

async def check_memory_conflict(new_memory, existing_memories, similarity_threshold=0.85):
    if not existing_memories:
        return {"action": "ADD", "conflict_memory_id": None}
    new_emb = await embedder.aembed_query(new_memory)
    for mem in existing_memories:
        if cosine_similarity(new_emb, mem["embedding"]) > similarity_threshold:
            if await llm_confirm_update(new_memory, mem["content"]):
                return {"action": "UPDATE", "conflict_memory_id": mem["memory_id"]}
    return {"action": "ADD", "conflict_memory_id": None}

TTL (Time‑to‑Live) Management

Time‑sensitive facts receive a TTL; a daily cleanup job marks expired records as soft‑deleted.

def add_memory_with_ttl(content, user_id, ttl_days=-1):
    ttl_timestamp = -1
    if ttl_days > 0:
        ttl_timestamp = int(time.time()) + ttl_days * 86400
    record = {
        "memory_id": str(uuid.uuid4()),
        "user_id": user_id,
        "content": content,
        "created_at": int(time.time()),
        "ttl": ttl_timestamp,
        "is_deleted": False,
    }
    milvus_client.insert("agent_memory", record)

async def cleanup_expired_memories():
    now = int(time.time())
    expired = milvus_client.query(
        collection_name="agent_memory",
        filter=f"ttl > 0 && ttl < {now} && is_deleted == false"
    )
    for mem in expired:
        milvus_client.update(collection_name="agent_memory", filter=f"memory_id == '{mem['memory_id']}'", data={"is_deleted": True})

Privacy & "Right to be Forgotten"

For compliance, a soft‑delete flag plus an immutable audit log satisfy GDPR‑like requirements.

async def forget_user(user_id, operator, reason):
    milvus_client.update(
        collection_name="agent_memory",
        filter=f"user_id == '{user_id}' && is_deleted == false",
        data={"is_deleted": True}
    )
    audit_log = {
        "operation": "USER_FORGET",
        "user_id": user_id,
        "operator": operator,
        "reason": reason,
        "timestamp": int(time.time()),
    }
    audit_db.insert(audit_log)
    return {"status": "success", "message": f"Deleted all memories of {user_id}"}

Hybrid Retrieval Formula

Combines semantic similarity, recency decay, and importance weighting.

def hybrid_memory_retrieval(query, user_id, top_k=5, alpha=0.4, beta=0.3, gamma=0.3):
    candidates = milvus_client.search(
        collection_name="agent_memory",
        data=[get_embedding(query)],
        filter=f"user_id == '{user_id}' && is_deleted == false",
        limit=top_k * 3,
        output_fields=["memory_id", "content", "created_at", "importance_score"]
    )
    now = time.time()
    scored = []
    for mem, semantic_score in zip(candidates[0], candidates[1]):
        recency = 0.995 ** ((now - mem["created_at"]) / 3600)
        importance = mem["importance_score"] / 10.0
        final = alpha * semantic_score + beta * recency + gamma * importance
        scored.append((mem, final))
    scored.sort(key=lambda x: x[1], reverse=True)
    return [m for m, _ in scored[:top_k]]

This reduces retrieval of outdated but semantically similar memories.

Memory Strength Update (Ebbinghaus Forgetting)

Each successful retrieval slightly boosts the importance score; unused memories decay daily and are eventually soft‑deleted.

def update_memory_strength(memory_id, was_retrieved):
    mem = milvus_client.get(memory_id)
    if was_retrieved:
        milvus_client.update(memory_id, {
            "last_accessed": time.time(),
            "importance_score": min(10.0, mem["importance_score"] + 0.5)
        })
    else:
        days = (time.time() - mem["last_accessed"]) / 86400
        new_imp = mem["importance_score"] * (0.95 ** days)
        if new_imp < 1.0:
            milvus_client.update(memory_id, {"is_deleted": True})
        else:
            milvus_client.update(memory_id, {"importance_score": new_imp})

Evaluation with LOCOMO Benchmark

Four dimensions are measured:

Retrieval accuracy (Top‑5 recall)

Information timeliness (conflict detection rate)

Privacy isolation (zero cross‑user leakage)

Storage efficiency (ratio of useful memories)

In the banking project, hybrid retrieval raised Top‑5 recall from 71% to 87%, and conflict detection reached 92%.

Interview‑Ready Answer Framework

Classification : Explain semantic, episodic, and procedural memory.

Architecture : Short‑term sliding window + long‑term Milvus partitions with importance scores.

Management : ADD/UPDATE/DELETE/NOOP logic, TTL cleanup, forgetting curve, soft‑delete + audit for compliance.

Retrieval : Hybrid scoring (semantic + recency + importance) before injecting into the system prompt.

Compliance : Right‑to‑be‑forgotten implementation.

Evaluation : Cite LOCOMO metrics and observed improvements.

Short‑Term + Long‑Term Memory Architecture

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Interview Preparation compliance Agent Memory Hybrid Retrieval

Written by

Wu Shixiong's Large Model Academy

We continuously share large‑model know‑how, helping you master core skills—LLM, RAG, fine‑tuning, deployment—from zero to job offer, tailored for career‑switchers, autumn recruiters, and those seeking stable large‑model positions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.