Artificial Intelligence 22 min read

Designing Agent Memory Systems: Four Types, Three Strategies, and Full Python Implementation

This article breaks down agentic memory into four distinct types—In‑context, External, Episodic, and Semantic/Parametric—explains three forgetting strategies (time decay, importance scoring, periodic consolidation), shows how memory flows through an agent loop, and provides complete Python code using OpenAI embeddings and ChromaDB for a production‑ready memory layer.

AI Tech Publishing

Mar 28, 2026

Designing Agent Memory Systems: Four Types, Three Strategies, and Full Python Implementation

1. What is Agentic Memory?

Agentic memory is not a single component but a backstage system that combines different storage back‑ends, retrieval methods, and intelligent management strategies so an AI agent can retain continuity, context, and learning across interactions.

Continuity concerns identity: the agent knows who you are and what preferences you have. Context concerns the current task: recent actions, tools used, and results needed for the next step. Learning concerns improvement: understanding what works and avoiding repeated mistakes.

2. Four Memory Types

2.1 In‑context Memory

The context window is the agent’s workbench; everything inside can be accessed instantly during a single forward pass, without a separate retrieval step. However, the window has a fixed token budget, and it is cleared when the session ends.

System prompt: agent persona, rules, abilities, current date/user info

Conversation history: the back‑and‑forth of the current session

Tool call results: outputs from recently invoked tools

Retrieved memories: snippets pulled from external storage

Scratchpad: intermediate reasoning steps

Sliding‑window problem : long conversations overflow the token limit. Simple truncation loses important early context. Better strategies include summarization, selective retention of key facts, and offloading important items to external memory.

2.2 External Memory

External memory lives outside the model—databases, vector stores, key‑value stores, or files—and persists across sessions. Properly designed, it lets an agent remember events from months ago.

Structured storage (exact queries) : PostgreSQL, Redis, SQLite. Fast, predictable, ideal for user profiles and structured data.

Vector store (semantic search) : Pinecone, Chroma, pgvector. Retrieves items by similarity, crucial for unstructured notes and episodic recall.

Retrieval is the bottleneck: if the correct memory cannot be found, the agent behaves as if it never existed, making retrieval quality responsible for ~80% of overall performance.

2.3 Episodic Memory

Episodic memory stores concrete events—what the agent actually did and the outcome. A simple implementation is a structured log where each completed task is recorded as a JSON document.

{
  "episode_id": "ep_20240315_003",
  "timestamp": "2024-03-15T14:23:11Z",
  "task": "Summarize 50-page PDF into 3 bullet points",
  "approach": "Sequential chunking, 2000 tokens per chunk",
  "outcome": "success",
  "duration_ms": 4820,
  "token_cost": 12400,
  "quality_score": 0.91,
  "notes": "Worked well. Hierarchical chunking would be faster.",
  "embedding": [0.023, -0.441, 0.182, /* ... 1536 dims */]
}

When a new task arrives, the agent retrieves the most semantically similar episodes and uses them as few‑shot examples, rather than relying on a static dataset.

2.4 Semantic/Parametric Memory

This is the knowledge baked into the model weights during pre‑training—world facts, language patterns, reasoning strategies, cultural knowledge. It is always available but has hard limits: the model cannot learn new facts after the training cutoff, cannot be updated without fine‑tuning, is opaque, and may hallucinate.

For time‑sensitive, domain‑specific, or private information, rely on external, episodic, or in‑context memory; treat parametric memory as a fallback for general world knowledge.

Correct mental model : parametric memory is the agent’s general education, while external, episodic, and in‑context memories are its on‑the‑job experience. The best agents combine both.

3. Memory Flow in the Agent Loop

Each request follows these steps:

Retrieve relevant memories (semantic search) and similar past episodes.

Inject the retrieved context into the system prompt.

Call the LLM to generate a response.

Store the interaction and episode for future use.

Memory operations wrap the LLM call: first retrieve, then write back. The model itself remains stateless; the memory layer gives the illusion of state.

4. Building the Memory Layer (Python)

4.1 MemoryStore class

import chromadb
from openai import OpenAI
from datetime import datetime
import json, uuid

class MemoryStore:
    """Persistent vector memory for an AI agent."""
    def __init__(self, agent_id: str, persist_dir: str = "./memory_db"):
        self.agent_id = agent_id
        self.openai = OpenAI()
        # ChromaDB stores vectors on disk, persists across restarts
        self.client = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.client.get_or_create_collection(
            name=f"agent_{agent_id}_memories",
            metadata={"hnsw:space": "cosine"}
        )

    def _embed(self, text: str) -> list[float]:
        """Convert text to embedding vector using OpenAI."""
        response = self.openai.embeddings.create(model="text-embedding-3-small", input=text)
        return response.data[0].embedding

    def remember(self, content: str, memory_type: str = "general", metadata: dict = None) -> str:
        """Store a memory. Returns the memory ID."""
        memory_id = str(uuid.uuid4())
        embedding = self._embed(content)
        meta = {
            "type": memory_type,
            "timestamp": datetime.utcnow().isoformat(),
            "agent_id": self.agent_id,
            **(metadata or {})
        }
        self.collection.add(ids=[memory_id], embeddings=[embedding], documents=[content], metadatas=[meta])
        return memory_id

    def recall(self, query: str, k: int = 5, memory_type: str = None, min_relevance: float = 0.6) -> list[dict]:
        """Retrieve the k most relevant memories for a query."""
        query_embedding = self._embed(query)
        where = {"type": memory_type} if memory_type else None
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=k,
            where=where,
            include=["documents", "metadatas", "distances"]
        )
        memories = []
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        ):
            relevance = 1 - dist  # cosine distance → similarity
            if relevance >= min_relevance:
                memories.append({"content": doc, "metadata": meta, "relevance": round(relevance, 3)})
        return sorted(memories, key=lambda x: x["relevance"], reverse=True)

    def forget(self, memory_id: str):
        """Delete a specific memory (GDPR compliance, stale data, etc.)"""
        self.collection.delete(ids=[memory_id])

4.2 EpisodicLogger class

from .store import MemoryStore
from dataclasses import dataclass, asdict
from typing import Optional
import time

@dataclass
class Episode:
    task: str
    approach: str
    outcome: str  # "success" | "partial" | "failure"
    duration_ms: int
    token_cost: int
    quality_score: float  # 0.0 – 1.0
    notes: str = ""
    error: Optional[str] = None

class EpisodicLogger:
    def __init__(self, memory_store: MemoryStore):
        self.store = memory_store

    def log(self, episode: Episode):
        """Save an episode to memory as a searchable document."""
        doc = (
            f"Task: {episode.task}
"
            f"Approach: {episode.approach}
"
            f"Outcome: {episode.outcome}
"
            f"Notes: {episode.notes}"
        )
        self.store.remember(
            content=doc,
            memory_type="episode",
            metadata={
                "outcome": episode.outcome,
                "quality_score": episode.quality_score,
                "duration_ms": episode.duration_ms,
                "token_cost": episode.token_cost,
            },
        )

    def recall_similar(self, task: str, k: int = 3) -> list[dict]:
        """Find past episodes similar to the current task."""
        return self.store.recall(query=task, k=k, memory_type="episode", min_relevance=0.65)

4.3 Memory‑augmented Agent

import anthropic
from memory.store import MemoryStore
from memory.episodic import EpisodicLogger, Episode
import time

class MemoryAugmentedAgent:
    def __init__(self, agent_id: str):
        self.client = anthropic.Anthropic()
        self.memory = MemoryStore(agent_id)
        self.episodes = EpisodicLogger(self.memory)

    def _build_memory_context(self, user_message: str) -> str:
        """Retrieve relevant memories and format them for injection."""
        memories = self.memory.recall(user_message, k=4)
        episodes = self.episodes.recall_similar(user_message, k=2)
        parts = []
        if memories:
            parts.append("## Relevant memories
" + "
".join(
                f"- [{m['metadata']['type']}] {m['content']} (relevance: {m['relevance']})"
                for m in memories
            ))
        if episodes:
            parts.append("## Past similar tasks
" + "
".join(
                f"- {e['content'][:200]}..." for e in episodes
            ))
        return "

".join(parts) if parts else ""

    def run(self, user_message: str) -> str:
        start = time.time()
        memory_context = self._build_memory_context(user_message)
        system = """You are a helpful agent with memory.
You have access to relevant context from past interactions.
Use this context to give better, more personalized responses.
"""
        if memory_context:
            system += f"

{memory_context}"
        response = self.client.messages.create(
            model="claude-opus-4-6",
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": user_message}],
        )
        answer = response.content[0].text
        duration = int((time.time() - start) * 1000)
        # Store the interaction
        self.memory.remember(content=f"User asked: {user_message[:200]}", memory_type="interaction")
        # Log the episode
        self.episodes.log(Episode(
            task=user_message[:200],
            approach="single-turn with memory retrieval",
            outcome="success",
            duration_ms=duration,
            token_cost=response.usage.input_tokens + response.usage.output_tokens,
            quality_score=1.0,
        ))
        return answer

5. Vector Database

5.1 Similarity Search Principle

Each memory is turned into a 1,536‑dimensional float vector using OpenAI’s embedding model. Similar texts produce similar vectors. At query time the system embeds the query and finds the nearest vectors by cosine similarity.

import numpy as np

def cosine_similarity(a: list, b: list) -> float:
    """1.0 = identical meaning, 0.0 = unrelated, -1.0 = opposite meaning"""
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example
embedding_a = embed("The user prefers dark mode")
embedding_b = embed("They like their interface theme to be dark")
score = cosine_similarity(embedding_a, embedding_b)  # → ~0.91

Local development uses ChromaDB. For production you may switch to pgvector (if using Postgres), or to managed services like Pinecone or Qdrant for larger scale.

6. Memory Management

6.1 Time‑based Decay

Older memories are usually less relevant. The following scoring function, inspired by the Generative Agents paper (Park et et al., 2023), combines relevance, importance, and recency.

import math
from datetime import datetime

def memory_score(
    relevance: float,      # cosine similarity 0–1
    importance: float,    # stored at write time 0–1
    created_at: datetime, # when memory was formed
    recency_weight: float = 0.3,
    decay_factor: float = 0.995,
) -> float:
    """Balance relevance, importance, and recency."""
    hours_old = (datetime.utcnow() - created_at).total_seconds() / 3600
    recency = math.pow(decay_factor, hours_old)
    return (
        relevance * 0.4 +
        importance * 0.3 +
        recency * recency_weight
    )

6.2 Importance Scoring at Write‑time

When storing a memory, the agent asks the LLM to rate its importance on a 0.0–1.0 scale and only keeps high‑scoring items.

import re

async def score_importance(client, content: str) -> float:
    """Ask the LLM if the information is worth saving (0.0‑1.0)."""
    prompt = f"""Rate the importance of saving this for future interactions.
0.0 = trivial (greeting)
0.5 = moderately useful
1.0 = critical (preferences, errors, decisions)

Information: {content}
Reply with ONLY the number."""
    try:
        response = await client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{"role": "user", "content": prompt}],
        )
        text = response.content[0].text.strip()
        match = re.search(r"[-+]?\d*\.\d+|\d+", text)
        if match:
            score = float(match.group())
            return max(0.0, min(1.0, score))
    except Exception:
        pass
    return 0.5  # fallback

6.3 Periodic Consolidation

Every night a task merges near‑duplicate memories into a single concise summary, similar to human sleep‑time memory consolidation.

async def consolidate_memories(store: MemoryStore, similarity_threshold: float = 0.92):
    """Efficiently merge near‑duplicate memories using vector search."""
    all_mems = store.collection.get(include=["documents", "embeddings", "ids"])
    if not all_mems["ids"]:
        return
    visited = set()
    consolidated = []
    for mem_id, doc, emb in zip(all_mems["ids"], all_mems["documents"], all_mems["embeddings"]):
        if mem_id in visited:
            continue
        results = store.collection.query(
            query_embeddings=[emb],
            n_results=10,
            include=["documents", "distances"],
        )
        group = [doc]
        visited.add(mem_id)
        for res_id, res_doc, dist in zip(
            results["ids"][0], results["documents"][0], results["distances"][0]
        ):
            sim = 1.0 - dist
            if res_id != mem_id and res_id not in visited and sim >= similarity_threshold:
                group.append(res_doc)
                visited.add(res_id)
        if len(group) > 1:
            summary = await summarize_group(group)  # assumed external summarizer
            consolidated.append(summary)
        else:
            consolidated.append(doc)
    store.collection.delete(where={})
    for doc in consolidated:
        await store.remember(doc)

7. Conclusion

Without a memory layer an agent starts each interaction from a blank slate. A well‑designed memory system—deciding what to remember, what to forget, and how to retrieve—enables the agent to retain identity, maintain context, and continuously learn, dramatically narrowing the gap between a stateless chatbot and a truly intelligent, evolving assistant.

memory management Python LLM semantic search Agent Memory Vector Store ChromaDB

Written by

AI Tech Publishing

In the fast-evolving AI era, we thoroughly explain stable technical foundations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.