Artificial Intelligence 14 min read

Choosing the Right AI Memory: Truncation, Summarization, or Vector Retrieval

This article breaks down LangChain.js's three memory strategies—window truncation, summary compression, and vector‑store retrieval—explaining their inner workings, code setup, trade‑offs in token cost and information retention, and provides a decision guide for selecting the best approach in multi‑turn LLM conversations.

James' Growth Diary

Apr 25, 2026

Choosing the Right AI Memory: Truncation, Summarization, or Vector Retrieval

Why Memory Management Matters

LLMs are stateless; each API call forgets previous turns. To simulate memory you must feed past messages back via the messages array. However, context windows are limited (e.g., GPT‑4o has 128K tokens) and token usage directly translates to cost, while early turns may contain crucial information that naive truncation discards.

Thus memory management is essentially an information‑compression‑and‑retrieval problem: keep the most valuable context within a token budget.

Three Memory Strategies in LangChain.js

LangChain.js offers three mainstream memory implementations:

Window Truncation – keep only the most recent N turns ("fish memory").

Summary Memory – compress older history into a concise summary ("meeting minutes").

Vector Store Retrieval – store each turn in a vector database and retrieve relevant snippets on demand ("notebook + search").

1. Window Truncation (BufferWindowMemory)

Retains the latest K turns and discards earlier ones. Suitable for casual chat where long‑range context is unnecessary.

import { ChatOpenAI } from "@langchain/openai";
import { ConversationChain } from "langchain/chains";
import { BufferWindowMemory } from "langchain/memory";
import { ChatPromptTemplate, MessagesPlaceholder } from "@langchain/core/prompts";

const model = new ChatOpenAI({ modelName: "gpt-4o-mini", temperature: 0.7 });

const memory = new BufferWindowMemory({
  k: 5,               // keep the last 5 turns
  returnMessages: true,
  memoryKey: "chat_history",
});

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a friendly AI assistant."],
  new MessagesPlaceholder("chat_history"),
  ["human", "{input}"],
]);

const chain = new ConversationChain({ llm: model, memory, prompt });

const vars = await memory.loadMemoryVariables({});
console.log("Current window:", vars);
await memory.saveContext({ input: "LangChain 和 LangGraph 有什么区别？" }, { output: "LangChain 是链式调用，LangGraph 是有向图" });

Window truncation keeps only the most recent K turns, discarding older ones

2. Summary Memory (ConversationSummaryBufferMemory)

Compresses older turns into a summary using a cheap LLM, then appends new turns verbatim. Keeps early information at the cost of an extra LLM call per turn.

import { ChatOpenAI } from "@langchain/openai";
import { ConversationSummaryBufferMemory } from "langchain/memory";
import { ConversationChain } from "langchain/chains";
import { ChatPromptTemplate, MessagesPlaceholder } from "@langchain/core/prompts";

const cheapModel = new ChatOpenAI({ modelName: "gpt-4o-mini", temperature: 0 });
const smartModel = new ChatOpenAI({ modelName: "gpt-4o" });

// Hybrid: keep recent raw turns, compress older ones
const memory = new ConversationSummaryBufferMemory({
  llm: cheapModel,
  maxTokenLimit: 1000,   // trigger summarization after 1000 tokens
  returnMessages: true,
  memoryKey: "chat_history",
});

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are a professional AI assistant. Below is a conversation summary to give you context."],
  new MessagesPlaceholder("chat_history"),
  ["human", "{input}"],
]);

const chain = new ConversationChain({ llm: smartModel, memory, prompt });
const res = await chain.invoke({ input: "现在遇到 Memory 管理的问题，你有什么建议？" });
const memVars = await memory.loadMemoryVariables({});
console.log("Current summary:", memVars.chat_history);

Summary memory workflow: old turns trigger LLM summarization, new turns are added raw

3. Vector Store Retrieval (VectorStoreRetrieverMemory)

Each turn is embedded and stored in a vector DB. On each new input the most relevant K snippets are retrieved and injected into the prompt, making it ideal for very long conversations and knowledge‑intensive Q&A.

import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { VectorStoreRetrieverMemory } from "langchain/memory";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { ConversationChain } from "langchain/chains";
import { ChatPromptTemplate } from "@langchain/core/prompts";

const vectorStore = new MemoryVectorStore(new OpenAIEmbeddings());
const memory = new VectorStoreRetrieverMemory({
  vectorStoreRetriever: vectorStore.asRetriever(3), // retrieve top 3 matches
  memoryKey: "relevant_history",
  inputKey: "input",
});

const prompt = ChatPromptTemplate.fromMessages([
  ["system", `You are a professional AI assistant. Here are the retrieved relevant snippets:
{relevant_history}
Answer the question using this context.`],
  ["human", "{input}"],
]);

const chain = new ConversationChain({
  llm: new ChatOpenAI({ modelName: "gpt-4o-mini" }),
  memory,
  prompt,
});

await memory.saveContext({ input: "我叫小明，用的向量库是 Milvus" }, { output: "了解，Milvus 适合生产环境" });
await chain.invoke({ input: "我应该用什么向量库？" }); // retrieves Milvus info

Vector retrieval flow: history stored in vector DB, current input retrieves relevant snippets

In production you can replace the in‑memory store with a persistent vector DB such as Milvus:

import { Milvus } from "@langchain/community/vectorstores/milvus";

const milvusStore = await Milvus.fromExistingCollection(
  new OpenAIEmbeddings(),
  { collectionName: "conversation_memory", url: "http://localhost:19530" }
);

const persistentMemory = new VectorStoreRetrieverMemory({
  vectorStoreRetriever: milvusStore.asRetriever(5),
  memoryKey: "relevant_history",
  inputKey: "input",
});

Selection Matrix

The three approaches differ along several dimensions:

Implementation complexity : Window truncation ★☆☆ (very simple), Summary ★★ (moderate), Vector ★★★ (higher).

Token consumption : Truncation low, Summary medium (extra summarization calls), Vector low (only retrieve few snippets).

Early‑turn information : Lost with truncation, preserved with summary and vector retrieval.

Best‑fit scenarios : Truncation for <10 turns, Summary for >10 turns where continuity matters, Vector for ultra‑long or knowledge‑dense QA.

Decision Tree

If conversation turns < 10 → use BufferWindowMemory.

If turns > 10 and context continuity is important → use ConversationSummaryBufferMemory (hybrid version).

If conversation is extremely long or requires precise back‑tracking → use VectorStoreRetrieverMemory.

If you want the ultimate performance → combine summary and vector retrieval.

Common Pitfalls

Pitfall 1: Forgetting returnMessages: true

// ❌ Returns plain strings – ChatModel cannot understand
const memory = new BufferWindowMemory({ k: 5 });
// ✅ Returns Message objects – compatible with ChatModel
const memory = new BufferWindowMemory({ k: 5, returnMessages: true });

Pitfall 2: Mismatched memoryKey and Prompt placeholder

// ❌ memoryKey "history" but Prompt uses "chat_history"
const memory = new ConversationSummaryMemory({ memoryKey: "history" });
const prompt = ChatPromptTemplate.fromMessages([
  new MessagesPlaceholder("chat_history"),
]);
// ✅ Keep them consistent
const memory = new ConversationSummaryMemory({ memoryKey: "chat_history" });

Pitfall 3: Summary memory adds extra LLM cost

Each turn triggers an additional LLM call for summarization. Use a cheap model for the summarizer and set a sensible maxTokenLimit to avoid frequent compression.

Pitfall 4: Vector retrieval may return semantically similar but not exact matches

When a user asks for a specific earlier point, the retrieved snippet might be only loosely related. Combine retrieval with timestamps or explicit indices for precise references.

Pitfall 5: No isolation in multi‑user scenarios

// ❌ Shared Memory instance across users – history leaks
const sharedMemory = new BufferWindowMemory({ k: 5 });
// ✅ Isolate per user/session
const memoryStore = new Map<string, BufferWindowMemory>();
function getMemory(userId: string) {
  if (!memoryStore.has(userId)) {
    memoryStore.set(userId, new BufferWindowMemory({ k: 5, returnMessages: true }));
  }
  return memoryStore.get(userId)!;
}

Checklist (Collectible List)

Conversation < 10 turns → BufferWindowMemory Need to keep early info → ConversationSummaryBufferMemory (hybrid)

Ultra‑long or knowledge‑intensive QA →

VectorStoreRetrieverMemory

Code Standards

When using a ChatModel, set returnMessages: true.

Ensure memoryKey matches the Prompt placeholder exactly.

Assign a cheap model to the summarizer to control cost.

Isolate Memory per userId / sessionId in multi‑user apps.

Performance & Cost Tips

Configure maxTokenLimit to limit summarization frequency.

For vector retrieval, a k of 3–5 is usually sufficient.

Use persistent vector stores (Milvus, Pinecone) in production.

Debugging Tricks

Inspect current memory content with memory.loadMemoryVariables({}).

Enable verbose: true on the chain to view the full LLM call chain.

Conclusion

This article dissected LangChain's three memory strategies from implementation to trade‑offs:

Window Truncation : simplest, keeps only the last K turns, best for lightweight chats.

Summary Memory : uses an LLM to compress history, retains early information, ideal for long dialogues.

Vector Store Retrieval : semantic recall of relevant snippets, suited for ultra‑long or knowledge‑dense interactions.

Selection Guideline : few turns → truncation; many turns with continuity → summary; extremely long or precise back‑tracking → vector retrieval.

Key Details : set returnMessages and align memoryKey, isolate memory per user, and watch the extra cost of summarization.

Next up we will explore best practices for designing Memory architectures in real projects, multi‑session management, and how to keep Memory fast and cheap.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

memory management LLM Prompt Engineering LangChain Vector Retrieval Conversation Memory

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.