Designing Agent Memory Systems: Four Types, Three Strategies, and Full Python Implementation
This article breaks down agentic memory into four distinct types—In‑context, External, Episodic, and Semantic/Parametric—explains three forgetting strategies (time decay, importance scoring, periodic consolidation), shows how memory flows through an agent loop, and provides complete Python code using OpenAI embeddings and ChromaDB for a production‑ready memory layer.
1. What is Agentic Memory?
Agentic memory is not a single component but a backstage system that combines different storage back‑ends, retrieval methods, and intelligent management strategies so an AI agent can retain continuity, context, and learning across interactions.
Continuity concerns identity: the agent knows who you are and what preferences you have. Context concerns the current task: recent actions, tools used, and results needed for the next step. Learning concerns improvement: understanding what works and avoiding repeated mistakes.
2. Four Memory Types
2.1 In‑context Memory
The context window is the agent’s workbench; everything inside can be accessed instantly during a single forward pass, without a separate retrieval step. However, the window has a fixed token budget, and it is cleared when the session ends.
System prompt: agent persona, rules, abilities, current date/user info
Conversation history: the back‑and‑forth of the current session
Tool call results: outputs from recently invoked tools
Retrieved memories: snippets pulled from external storage
Scratchpad: intermediate reasoning steps
Sliding‑window problem : long conversations overflow the token limit. Simple truncation loses important early context. Better strategies include summarization, selective retention of key facts, and offloading important items to external memory.
2.2 External Memory
External memory lives outside the model—databases, vector stores, key‑value stores, or files—and persists across sessions. Properly designed, it lets an agent remember events from months ago.
Structured storage (exact queries) : PostgreSQL, Redis, SQLite. Fast, predictable, ideal for user profiles and structured data.
Vector store (semantic search) : Pinecone, Chroma, pgvector. Retrieves items by similarity, crucial for unstructured notes and episodic recall.
Retrieval is the bottleneck: if the correct memory cannot be found, the agent behaves as if it never existed, making retrieval quality responsible for ~80% of overall performance.
2.3 Episodic Memory
Episodic memory stores concrete events—what the agent actually did and the outcome. A simple implementation is a structured log where each completed task is recorded as a JSON document.
{
"episode_id": "ep_20240315_003",
"timestamp": "2024-03-15T14:23:11Z",
"task": "Summarize 50-page PDF into 3 bullet points",
"approach": "Sequential chunking, 2000 tokens per chunk",
"outcome": "success",
"duration_ms": 4820,
"token_cost": 12400,
"quality_score": 0.91,
"notes": "Worked well. Hierarchical chunking would be faster.",
"embedding": [0.023, -0.441, 0.182, /* ... 1536 dims */]
}When a new task arrives, the agent retrieves the most semantically similar episodes and uses them as few‑shot examples, rather than relying on a static dataset.
2.4 Semantic/Parametric Memory
This is the knowledge baked into the model weights during pre‑training—world facts, language patterns, reasoning strategies, cultural knowledge. It is always available but has hard limits: the model cannot learn new facts after the training cutoff, cannot be updated without fine‑tuning, is opaque, and may hallucinate.
For time‑sensitive, domain‑specific, or private information, rely on external, episodic, or in‑context memory; treat parametric memory as a fallback for general world knowledge.
Correct mental model : parametric memory is the agent’s general education, while external, episodic, and in‑context memories are its on‑the‑job experience. The best agents combine both.
3. Memory Flow in the Agent Loop
Each request follows these steps:
Retrieve relevant memories (semantic search) and similar past episodes.
Inject the retrieved context into the system prompt.
Call the LLM to generate a response.
Store the interaction and episode for future use.
Memory operations wrap the LLM call: first retrieve, then write back. The model itself remains stateless; the memory layer gives the illusion of state.
4. Building the Memory Layer (Python)
4.1 MemoryStore class
import chromadb
from openai import OpenAI
from datetime import datetime
import json, uuid
class MemoryStore:
"""Persistent vector memory for an AI agent."""
def __init__(self, agent_id: str, persist_dir: str = "./memory_db"):
self.agent_id = agent_id
self.openai = OpenAI()
# ChromaDB stores vectors on disk, persists across restarts
self.client = chromadb.PersistentClient(path=persist_dir)
self.collection = self.client.get_or_create_collection(
name=f"agent_{agent_id}_memories",
metadata={"hnsw:space": "cosine"}
)
def _embed(self, text: str) -> list[float]:
"""Convert text to embedding vector using OpenAI."""
response = self.openai.embeddings.create(model="text-embedding-3-small", input=text)
return response.data[0].embedding
def remember(self, content: str, memory_type: str = "general", metadata: dict = None) -> str:
"""Store a memory. Returns the memory ID."""
memory_id = str(uuid.uuid4())
embedding = self._embed(content)
meta = {
"type": memory_type,
"timestamp": datetime.utcnow().isoformat(),
"agent_id": self.agent_id,
**(metadata or {})
}
self.collection.add(ids=[memory_id], embeddings=[embedding], documents=[content], metadatas=[meta])
return memory_id
def recall(self, query: str, k: int = 5, memory_type: str = None, min_relevance: float = 0.6) -> list[dict]:
"""Retrieve the k most relevant memories for a query."""
query_embedding = self._embed(query)
where = {"type": memory_type} if memory_type else None
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=k,
where=where,
include=["documents", "metadatas", "distances"]
)
memories = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
relevance = 1 - dist # cosine distance → similarity
if relevance >= min_relevance:
memories.append({"content": doc, "metadata": meta, "relevance": round(relevance, 3)})
return sorted(memories, key=lambda x: x["relevance"], reverse=True)
def forget(self, memory_id: str):
"""Delete a specific memory (GDPR compliance, stale data, etc.)"""
self.collection.delete(ids=[memory_id])4.2 EpisodicLogger class
from .store import MemoryStore
from dataclasses import dataclass, asdict
from typing import Optional
import time
@dataclass
class Episode:
task: str
approach: str
outcome: str # "success" | "partial" | "failure"
duration_ms: int
token_cost: int
quality_score: float # 0.0 – 1.0
notes: str = ""
error: Optional[str] = None
class EpisodicLogger:
def __init__(self, memory_store: MemoryStore):
self.store = memory_store
def log(self, episode: Episode):
"""Save an episode to memory as a searchable document."""
doc = (
f"Task: {episode.task}
"
f"Approach: {episode.approach}
"
f"Outcome: {episode.outcome}
"
f"Notes: {episode.notes}"
)
self.store.remember(
content=doc,
memory_type="episode",
metadata={
"outcome": episode.outcome,
"quality_score": episode.quality_score,
"duration_ms": episode.duration_ms,
"token_cost": episode.token_cost,
},
)
def recall_similar(self, task: str, k: int = 3) -> list[dict]:
"""Find past episodes similar to the current task."""
return self.store.recall(query=task, k=k, memory_type="episode", min_relevance=0.65)4.3 Memory‑augmented Agent
import anthropic
from memory.store import MemoryStore
from memory.episodic import EpisodicLogger, Episode
import time
class MemoryAugmentedAgent:
def __init__(self, agent_id: str):
self.client = anthropic.Anthropic()
self.memory = MemoryStore(agent_id)
self.episodes = EpisodicLogger(self.memory)
def _build_memory_context(self, user_message: str) -> str:
"""Retrieve relevant memories and format them for injection."""
memories = self.memory.recall(user_message, k=4)
episodes = self.episodes.recall_similar(user_message, k=2)
parts = []
if memories:
parts.append("## Relevant memories
" + "
".join(
f"- [{m['metadata']['type']}] {m['content']} (relevance: {m['relevance']})"
for m in memories
))
if episodes:
parts.append("## Past similar tasks
" + "
".join(
f"- {e['content'][:200]}..." for e in episodes
))
return "
".join(parts) if parts else ""
def run(self, user_message: str) -> str:
start = time.time()
memory_context = self._build_memory_context(user_message)
system = """You are a helpful agent with memory.
You have access to relevant context from past interactions.
Use this context to give better, more personalized responses.
"""
if memory_context:
system += f"
{memory_context}"
response = self.client.messages.create(
model="claude-opus-4-6",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": user_message}],
)
answer = response.content[0].text
duration = int((time.time() - start) * 1000)
# Store the interaction
self.memory.remember(content=f"User asked: {user_message[:200]}", memory_type="interaction")
# Log the episode
self.episodes.log(Episode(
task=user_message[:200],
approach="single-turn with memory retrieval",
outcome="success",
duration_ms=duration,
token_cost=response.usage.input_tokens + response.usage.output_tokens,
quality_score=1.0,
))
return answer5. Vector Database
5.1 Similarity Search Principle
Each memory is turned into a 1,536‑dimensional float vector using OpenAI’s embedding model. Similar texts produce similar vectors. At query time the system embeds the query and finds the nearest vectors by cosine similarity.
import numpy as np
def cosine_similarity(a: list, b: list) -> float:
"""1.0 = identical meaning, 0.0 = unrelated, -1.0 = opposite meaning"""
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Example
embedding_a = embed("The user prefers dark mode")
embedding_b = embed("They like their interface theme to be dark")
score = cosine_similarity(embedding_a, embedding_b) # → ~0.91Local development uses ChromaDB. For production you may switch to pgvector (if using Postgres), or to managed services like Pinecone or Qdrant for larger scale.
6. Memory Management
6.1 Time‑based Decay
Older memories are usually less relevant. The following scoring function, inspired by the Generative Agents paper (Park et et al., 2023), combines relevance, importance, and recency.
import math
from datetime import datetime
def memory_score(
relevance: float, # cosine similarity 0–1
importance: float, # stored at write time 0–1
created_at: datetime, # when memory was formed
recency_weight: float = 0.3,
decay_factor: float = 0.995,
) -> float:
"""Balance relevance, importance, and recency."""
hours_old = (datetime.utcnow() - created_at).total_seconds() / 3600
recency = math.pow(decay_factor, hours_old)
return (
relevance * 0.4 +
importance * 0.3 +
recency * recency_weight
)6.2 Importance Scoring at Write‑time
When storing a memory, the agent asks the LLM to rate its importance on a 0.0–1.0 scale and only keeps high‑scoring items.
import re
async def score_importance(client, content: str) -> float:
"""Ask the LLM if the information is worth saving (0.0‑1.0)."""
prompt = f"""Rate the importance of saving this for future interactions.
0.0 = trivial (greeting)
0.5 = moderately useful
1.0 = critical (preferences, errors, decisions)
Information: {content}
Reply with ONLY the number."""
try:
response = await client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=10,
messages=[{"role": "user", "content": prompt}],
)
text = response.content[0].text.strip()
match = re.search(r"[-+]?\d*\.\d+|\d+", text)
if match:
score = float(match.group())
return max(0.0, min(1.0, score))
except Exception:
pass
return 0.5 # fallback6.3 Periodic Consolidation
Every night a task merges near‑duplicate memories into a single concise summary, similar to human sleep‑time memory consolidation.
async def consolidate_memories(store: MemoryStore, similarity_threshold: float = 0.92):
"""Efficiently merge near‑duplicate memories using vector search."""
all_mems = store.collection.get(include=["documents", "embeddings", "ids"])
if not all_mems["ids"]:
return
visited = set()
consolidated = []
for mem_id, doc, emb in zip(all_mems["ids"], all_mems["documents"], all_mems["embeddings"]):
if mem_id in visited:
continue
results = store.collection.query(
query_embeddings=[emb],
n_results=10,
include=["documents", "distances"],
)
group = [doc]
visited.add(mem_id)
for res_id, res_doc, dist in zip(
results["ids"][0], results["documents"][0], results["distances"][0]
):
sim = 1.0 - dist
if res_id != mem_id and res_id not in visited and sim >= similarity_threshold:
group.append(res_doc)
visited.add(res_id)
if len(group) > 1:
summary = await summarize_group(group) # assumed external summarizer
consolidated.append(summary)
else:
consolidated.append(doc)
store.collection.delete(where={})
for doc in consolidated:
await store.remember(doc)7. Conclusion
Without a memory layer an agent starts each interaction from a blank slate. A well‑designed memory system—deciding what to remember, what to forget, and how to retrieve—enables the agent to retain identity, maintain context, and continuously learn, dramatically narrowing the gap between a stateless chatbot and a truly intelligent, evolving assistant.
AI Tech Publishing
In the fast-evolving AI era, we thoroughly explain stable technical foundations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
