Designing Agent Memory Systems: Short‑Term, Long‑Term, and Knowledge Graph Layers
The article breaks down how to build a three‑layer memory architecture for AI agents—short‑term context windows with sliding‑window summarization, long‑term semantic retrieval via vector databases with selective storage and time decay, and a knowledge‑graph layer for relational reasoning—plus implementation tips and common pitfalls.
Agent memory system layers the information that is "just happened", "learned", and "long‑term accumulated", analogous to human working, episodic, and semantic memory.
Short‑Term Memory – Proper Use of the Context Window
Appending every message to the context window quickly exhausts the token limit and causes early content to be forgotten after about 20 rounds.
// ❌ Naïve implementation – will OOM
const messages: Message[] = [];
async function chat(userInput: string) {
messages.push({ role: "user", content: userInput });
const response = await llm.invoke(messages); // context keeps growing
messages.push({ role: "assistant", content: response.content });
return response.content;
}The correct approach combines a sliding window with model‑based summarization. Keep the most recent 10 messages (5 rounds) and compress older messages into a short summary that is injected as a system message.
import { ChatOpenAI } from "@langchain/openai";
import { SystemMessage, HumanMessage } from "@langchain/core/messages";
const llm = new ChatOpenAI({ model: "gpt-4o-mini" });
async function manageHistory(history: Message[], maxTokens = 4000): Promise<Message[]> {
const currentTokens = estimateTokens(history);
if (currentTokens <= maxTokens) return history;
const recent = history.slice(-10); // keep recent context
const older = history.slice(0, -10);
const summary = await llm.invoke([
new SystemMessage("Summarize the following conversation history into ≤200 characters, preserving key decisions, user preferences, and important conclusions."),
new HumanMessage(older.map(m => `${m.role}: ${m.content}`).join("
"))
]);
return [
new SystemMessage(`Conversation summary: ${summary.content}`),
...recent,
];
}
function estimateTokens(messages: Message[]): number {
return messages.reduce((sum, m) => sum + m.content.length / 4, 0);
}Key takeaway: Short‑term memory is not "the longer the better"; a sliding window plus summarization balances cost and quality.
Long‑Term Memory – Vector Database for Semantic Retrieval
Long‑term memory answers the question "What I told you a month ago, can you recall now?". It stores selected dialogue snippets as embeddings and retrieves them by semantic similarity.
import { OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { Document } from "@langchain/core/documents";
const embeddings = new OpenAIEmbeddings();
const vectorStore = new MemoryVectorStore(embeddings); // replace with Pinecone / Weaviate / Chroma in production
// ✅ Selective storage – only keep valuable information
async function saveToLongTermMemory(content: string, metadata: {
type: "preference" | "fact" | "decision" | "task";
importance: "high" | "medium" | "low";
userId: string;
}) {
if (metadata.importance === "low") return; // discard noise
await vectorStore.addDocuments([
new Document({
pageContent: content,
metadata: { ...metadata, timestamp: Date.now() }
})
]);
}
// Retrieval with time decay – recent memories weigh more
async function recallMemory(query: string, userId: string) {
const results = await vectorStore.similaritySearchWithScore(query, 5, { userId });
const now = Date.now();
const decayed = results.map(([doc, score]) => {
const ageInDays = (now - doc.metadata.timestamp) / (1000 * 60 * 60 * 24);
const decayFactor = ageInDays > 30 ? 0.7 : 1.0;
return { doc, score: score * decayFactor };
});
return decayed.filter(r => r.score > 0.7).map(r => r.doc.pageContent);
}
// Inject retrieved memories before answering
async function agentWithMemory(userInput: string, userId: string) {
const memories = await recallMemory(userInput, userId);
const systemPrompt = memories.length > 0
? `You remember the following about this user:
${memories.join("
")}
Answer based on these memories.`
: "You are an assistant.";
const response = await llm.invoke([
new SystemMessage(systemPrompt),
new HumanMessage(userInput)
]);
return response.content;
}Key takeaway: Long‑term memory is not a dump of every utterance; selective storage plus time‑based decay keeps the signal‑to‑noise ratio high.
Knowledge Graph – Structured Knowledge for Relational Reasoning
The knowledge‑graph layer enables the agent to infer new facts from existing connections (e.g., if a user likes React and works on an AI project, the agent can recommend Vercel AI SDK).
interface MemoryNode {
id: string;
type: "person" | "project" | "preference" | "event";
attributes: Record<string, unknown>;
relations: Array<{ type: string; targetId: string }>;
embedding?: number[]; // optional vector for fuzzy search
}
class StructuredMemoryStore {
private nodes = new Map<string, MemoryNode>();
// Upsert merges attributes instead of overwriting
upsert(node: MemoryNode) {
const existing = this.nodes.get(node.id);
if (existing) {
this.nodes.set(node.id, {
...existing,
attributes: { ...existing.attributes, ...node.attributes },
relations: [...new Set([...existing.relations, ...node.relations])],
});
} else {
this.nodes.set(node.id, node);
}
}
// Retrieve nodes within `depth` hops
getRelated(nodeId: string, depth = 2): MemoryNode[] {
const visited = new Set<string>();
const result: MemoryNode[] = [];
const traverse = (id: string, currentDepth: number) => {
if (currentDepth === 0 || visited.has(id)) return;
visited.add(id);
const node = this.nodes.get(id);
if (!node) return;
result.push(node);
node.relations.forEach(({ targetId }) => traverse(targetId, currentDepth - 1));
};
traverse(nodeId, depth);
return result;
}
// Serialize for LLM consumption
toContextString(nodeId: string): string {
const related = this.getRelated(nodeId);
return related.map(n => `[${n.type}] ${JSON.stringify(n.attributes)}`).join("
");
}
}
// Example usage
const memStore = new StructuredMemoryStore();
memStore.upsert({
id: "user_james",
type: "person",
attributes: { name: "James", role: "frontend_developer" },
relations: [
{ type: "works_on", targetId: "project_ai_assistant" },
{ type: "prefers", targetId: "tech_react" }
]
});
memStore.upsert({
id: "tech_react",
type: "preference",
attributes: { name: "React", category: "frontend_framework" },
relations: [{ type: "has_ecosystem", targetId: "tech_vercel_ai_sdk" }]
});
const context = memStore.toContextString("user_james"); // yields structured linesKey takeaway: Knowledge graphs enable relational inference; a hybrid of structured JSON and optional vectors offers the best cost‑performance for most projects.
Three‑Layer Collaboration – A Fully‑Aware Agent
import { ChatOpenAI } from "@langchain/openai";
import { SystemMessage, HumanMessage } from "@langchain/core/messages";
class MemoryAwareAgent {
private llm = new ChatOpenAI({ model: "gpt-4o" });
private shortTermHistory: Message[] = [];
private longTermStore: LongTermMemoryStore;
private knowledgeGraph: StructuredMemoryStore;
constructor(private userId: string) {
this.longTermStore = new LongTermMemoryStore();
this.knowledgeGraph = new StructuredMemoryStore();
}
async chat(userInput: string): Promise<string> {
// Step 1: parallel retrieval of long‑term memory and graph context
const [longTermMemories, graphContext] = await Promise.all([
this.longTermStore.recall(userInput, this.userId),
Promise.resolve(this.knowledgeGraph.toContextString(this.userId))
]);
// Step 2: build system prompt
const systemPrompt = this.buildSystemPrompt(longTermMemories, graphContext);
// Step 3: manage short‑term history (sliding window + summary)
const managedHistory = await manageHistory(this.shortTermHistory);
// Step 4: call the model
const response = await this.llm.invoke([
new SystemMessage(systemPrompt),
...managedHistory,
new HumanMessage(userInput)
]);
// Step 5: update short‑term history
this.shortTermHistory.push(
{ role: "user", content: userInput },
{ role: "assistant", content: response.content as string }
);
// Step 6: asynchronously decide whether to persist to long‑term memory
this.asyncSaveMemory(userInput, response.content as string);
return response.content as string;
}
private buildSystemPrompt(memories: string[], graphCtx: string): string {
const parts = ["You are a memory‑enabled AI assistant."];
if (graphCtx) parts.push(`
About the user you know:
${graphCtx}`);
if (memories.length) parts.push(`
Relevant past memories:
${memories.join("
")}`);
return parts.join("
");
}
private async asyncSaveMemory(input: string, output: string) {
const shouldSave = await this.llm.invoke([
new SystemMessage("Answer yes or no: Does the following conversation contain information worth storing for long‑term memory?"),
new HumanMessage(`User: ${input}
Assistant: ${output}`)
]);
if (shouldSave.content.toString().toLowerCase().includes("yes")) {
await this.longTermStore.save(`User said: ${input}, Assistant replied: ${output}`, {
type: "conversation",
importance: "high",
userId: this.userId
});
}
}
}Key takeaway: Each layer has a distinct role; parallel retrieval and asynchronous storage keep the agent responsive while providing persistent memory.
Common Pitfalls
Pitfall 1: Storing Every Utterance in the Vector Store
// ❌ Storing every message creates noise
await vectorStore.addDocuments([new Document({ pageContent: msg.content })]);Result: the store fills with filler phrases like "okay" or "got it", degrading retrieval quality.
// ✅ Let the model decide if the message is worth storing
const worthSaving = await llm.invoke([
new SystemMessage("Does this sentence contain key information worth remembering? yes/no"),
new HumanMessage(msg.content)
]);
if (worthSaving.content === "yes") {
await vectorStore.addDocuments([/* … */]);
}Pitfall 2: Retrieval Without User Isolation
// ❌ Global search mixes memories across users
const results = await vectorStore.similaritySearch(query, 5);Fix: filter by userId during retrieval.
// ✅ Scoped search
const results = await vectorStore.similaritySearch(query, 5, { userId: currentUserId });Pitfall 3: Hard Truncation of Short‑Term History
// ❌ Simple slice drops essential context
if (messages.length > 20) messages = messages.slice(-20);Solution: summarize older messages before truncation.
// ✅ Summarize then keep recent part
const summary = await summarizeOlderMessages(messages.slice(0, -20));
messages = [new SystemMessage(`History summary: ${summary}`), ...messages.slice(-20)];Pitfall 4: Injecting Too Many Memories Into the Prompt
// ❌ Adding 20 retrieved memories blows up the context
const systemPrompt = `You know:
${memories.join("
")}`;Best practice: keep only highly relevant memories (score > 0.8) and limit to a few entries.
// ✅ Filter and limit
const results = await vectorStore.similaritySearchWithScore(query, 5);
const relevant = results.filter(([, score]) => score > 0.8).map(([doc]) => doc.pageContent);Pitfall 5: Forgetting Expiration (TTL) for Memories
// ✅ Store with TTL and filter out‑of‑date entries
await vectorStore.addDocuments([new Document({
pageContent: content,
metadata: { timestamp: Date.now(), ttlDays: 90 }
})]);
const cutoff = Date.now() - 90 * 24 * 60 * 60 * 1000;
const results = await vectorStore.similaritySearch(query, 5, { filter: { timestamp: { $gt: cutoff } } });Selection Reference
Single‑user simple chatbot : MemoryVectorStore (LangChain in‑memory). Good for development, no persistence.
Multi‑user production : Chroma / Pinecone / Weaviate. Chroma is easy to self‑host; Pinecone is managed.
Need relational reasoning : Neo4j + vector hybrid. Higher complexity, not required for most agents.
Enterprise internal knowledge base : pgvector (PostgreSQL plugin). Leverages existing PostgreSQL infrastructure.
Pre‑Release Checklist
Short‑term memory uses sliding window + summarization, not blind appending.
Vector store retrieval includes user‑level filtering.
Importance check before persisting to long‑term store.
Only inject memories with relevance > 0.7 and limit to ≤ 5 items.
Memories have TTL; they are not permanent.
Long‑term storage is asynchronous, not blocking the response.
Conclusion
Short‑term memory: Context window should be managed with a sliding window and model‑based summarization.
Long‑term memory: Vector databases provide semantic retrieval; selective storage and time decay keep relevance high.
Knowledge graph: Handles relational inference; a hybrid of structured JSON and vectors offers the best cost‑performance.
Three‑layer collaboration: Parallel retrieval and asynchronous storage give agents memory without slowing them down.
Common pitfalls: Over‑storing, missing user isolation, hard truncation, over‑injecting, and forgetting expiration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
