How to Build Short‑Term and Long‑Term Memory for LLM Agents Using Vector DBs and RAG

The article analyzes Agent memory design by comparing human short‑term and long‑term memory, explains context‑window management strategies, outlines persistent storage options such as vector databases, relational stores, knowledge graphs and fine‑tuning, and presents a three‑layer architecture with write, retrieval and forgetting mechanisms.

Linyb Geek Road
Linyb Geek Road
Linyb Geek Road
How to Build Short‑Term and Long‑Term Memory for LLM Agents Using Vector DBs and RAG

Problem Analysis

The interview question probes deep understanding of Agent architecture, expecting candidates to define short‑term and long‑term memory, describe concrete implementations, and discuss engineering challenges such as limited context windows and persistent storage.

Human Memory Analogy

Human memory is divided into short‑term (working) memory with limited capacity and long‑term memory with virtually unlimited capacity. Mapping this to Agents, short‑term memory corresponds to the current dialogue context, while long‑term memory corresponds to cross‑conversation persistent knowledge.

Short‑Term Memory: Managing the Context Window

Agents use the LLM's context window to hold short‑term memory. Although models now support up to 128K tokens, complex tasks can quickly fill the window, leading to the "Lost in the Middle" phenomenon where middle‑position information receives less attention.

Sliding Window : truncate the earliest messages when the window exceeds its limit; simple but can discard crucial early information.

Conversation Summary : compress older dialogue into a summary using an LLM. LangChain implements this with ConversationSummaryMemory and ConversationSummaryBufferMemory, preserving recent turns while summarizing earlier ones.

Token Buffer : set a hard token limit (e.g., 4000 tokens) and drop earliest messages token‑by‑token until under the limit, offering finer control than round‑based truncation.

Importance‑Based Retention : evaluate each message's importance (e.g., user requirements vs. chit‑chat) and keep important messages while discarding less relevant ones, typically requiring an additional LLM call for scoring.

1
1

Long‑Term Memory: Persistent Knowledge Across Sessions

Long‑term memory stores cross‑dialogue knowledge and can be split into explicit memory (facts and events) and implicit memory (skills baked into model parameters via fine‑tuning).

Common engineering solutions include:

Vector Database + RAG : embed summaries, user profiles, domain documents and store them in Milvus, Pinecone, Chroma, Weaviate, etc. Retrieval is performed by similarity search, allowing semantic matching even when wording differs.

Relational/Key‑Value Store : store highly structured data such as user profiles in MySQL or PostgreSQL for precise queries.

Knowledge Graph : capture entity relationships (e.g., "Alice is a product manager") in graph databases like Neo4j, enabling graph queries during reasoning.

Fine‑tuning (Implicit Memory) : inject domain knowledge into model parameters, eliminating extra retrieval steps but requiring costly re‑training for updates.

4
4

Write, Retrieval, and Update Mechanisms

Write : after each turn, a memory‑management module decides whether information should be persisted, typically after a summary extraction step that compresses the dialogue.

Retrieval : before handling a new task, the Agent retrieves relevant long‑term memories. Retrieval combines vector similarity with metadata filtering (e.g., time range, user ID) and reranking (e.g., cross‑encoder) to improve precision.

Update & Forgetting : to avoid unbounded growth, mechanisms such as time‑decay weighting, periodic deduplication, and explicit user‑driven corrections are applied.

5
5

Practical Layered Memory Architecture

A three‑layer design is commonly used:

Instant Context : the current LLM prompt containing system prompts, recent turns, and injected long‑term memories.

Session Cache : full conversation history stored in an in‑memory database like Redis; when the context window overflows, relevant parts are retrieved or summarized from the cache.

Persistent Store : long‑term memory residing in vector databases, relational databases, and knowledge graphs.

Information flows bidirectionally: important short‑term data is distilled into long‑term storage, and needed long‑term knowledge is fetched back into the instant context, achieving fast responses with virtually unlimited memory capacity.

6
6
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMLangChainRAGvector databaseAgent Memorycontext window
Linyb Geek Road
Written by

Linyb Geek Road

Tech notes

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.