8 Memory Strategies for AI Agents: From Full Recall to Vector Stores
The article examines eight common AI memory techniques—from simple full‑history retention to sophisticated vector‑store and knowledge‑graph approaches—detailing their principles, Python‑style implementations, advantages, drawbacks, and ideal application scenarios for large‑language‑model agents in production environments.
1. Full Memory – No Forgetting
Stores every turn of the conversation in a list and sends the entire history to the LLM for each inference. Simple implementation:
history = []
def add_message(user_input, ai_response):
turn = {"user": user_input, "assistant": ai_response}
history.append(turn)
def get_context(query):
return concat_all(history)Pros: Trivial to implement; guarantees no loss of information.
Cons: Context length grows linearly, quickly hitting the model’s token limit; higher latency and cost; early turns are truncated once the window is exceeded.
Suitable for: Very short dialogues or Q&A where every detail must be retained.
2. Sliding Window
Keeps only the most recent WINDOW_SIZE turns, discarding older ones. This mimics human focus on recent context.
memory = []
WINDOW_SIZE = 3 # keep at most 3 turns
def add_message(user_input, ai_response):
turn = {"user": user_input, "assistant": ai_response}
memory.append(turn)
if len(memory) > WINDOW_SIZE:
memory.pop(0) # drop the oldest turn
def get_context(query):
return concat_all(memory)Pros: Predictable memory footprint; low overhead.
Cons: Strong forgetting – once a turn slides out it cannot be recovered, harming long‑term coherence.
Suitable for: Short‑conversation tasks such as FAQ bots where long‑term dependencies are minimal.
3. Relevance Filtering
Assigns a relevance score to each turn; when capacity is exceeded the lowest‑scoring items are removed. Scores can be based on topical similarity, recency, or explicit importance.
memory = []
MAX_ITEMS = 25
def add_message(user_input, ai_response):
item = {
"user": user_input,
"assistant": ai_response,
"score": evaluate(user_input, ai_response) # user‑defined scoring function
}
memory.append(item)
if len(memory) > MAX_ITEMS:
to_remove = min(memory, key=lambda x: x["score"])
memory.remove(to_remove)
def get_context(query):
# assume each item has an "order" field for chronological sorting
sorted_mem = sorted(memory, key=lambda x: x["order"])
return concat_all(sorted_mem)Pros: Retains high‑value information while discarding noise.
Cons: Requires a reliable scoring function; mis‑scoring can delete important content.
Suitable for: Information‑dense dialogues where only key facts need to be kept (e.g., research‑assistant bots).
4. Summarization / Compression
Periodically compresses older turns into a concise summary generated by an LLM. The summary replaces the raw text in memory, keeping the context size bounded.
memory = []
summary = None
MAX_LEN = 10 # keep at most 10 turns before summarizing
def add_message(user_input, ai_response):
turn = {"user": user_input, "assistant": ai_response}
memory.append(turn)
if len(memory) > MAX_LEN:
old_turns = memory[:-5]
summary_text = summarize(old_turns) # LLM‑generated summary
summary = merge(summary, summary_text) if summary else summary_text
memory.clear()
memory.append({"summary": summary})
memory.extend(memory[-5:])
def get_context(query):
return concat_all(memory)Pros: Drastically reduces context length while preserving long‑term key points; improves LLM focus.
Cons: Quality depends on the summarizer; errors or bias in the summary propagate to later responses.
Suitable for: Long conversations where retaining the gist (e.g., AI therapist, personal assistant) is more important than verbatim history.
5. Vector Database (Semantic Retrieval)
Embeds each turn and stores the embedding in an external vector store (e.g., Chroma, Pinecone). At inference time the most semantically similar past turns are retrieved and added to the prompt.
memory = VectorStore()
def add_message(user_input, ai_response):
turn = {"user": user_input, "assistant": ai_response}
embedding = embed(turn) # any embedding model
memory.add(embedding, turn)
def get_context(query):
q_embedding = embed(query)
results = memory.search(q_embedding, top_k=3)
return concat_all(results)Pros: Near‑infinite long‑term memory; retrieval based on semantic similarity rather than keyword matching.
Cons: Depends on embedding quality; adds compute and infrastructure overhead.
Suitable for: Agents that need to recall facts across sessions, such as personalized assistants or legal‑advice bots.
6. Knowledge Graph (Structured Memory)
Extracts entities, attributes, and relationships from dialogue and stores them as triples in a graph. Retrieval follows relationship paths, enabling multi‑hop reasoning.
graph = KnowledgeGraph()
def add_message(user_input, ai_response):
full_text = f"User: {user_input}
AI: {ai_response}"
triples = extract_triples(full_text) # LLM‑driven extraction
for s, r, o in triples:
graph.add_edge(s.strip(), o.strip(), relation=r.strip())
def get_context(query):
entities = extract_entities(query)
context = []
for e in entities:
context += graph.query(e)
return contextPros: Structured retrieval and explainable reasoning; excels at knowledge‑intensive tasks.
Cons: High engineering cost; extraction errors; scalability challenges for very large graphs.
Suitable for: Domains requiring precise factual chains, such as enterprise support or scientific assistants.
7. Hierarchical Memory (Short‑term + Long‑term)
Combines a sliding‑window short‑term buffer with a vector‑store long‑term store. Important information is promoted from short‑term to long‑term based on keyword triggers.
short_term = SlidingWindow(max_turns=2)
long_term = VectorDatabase(k=2)
promotion_keywords = ["记住", "总是", "从不", "我过敏", "我的ID是", "我喜欢", "我讨厌"]
def add_message(user_input, ai_response):
short_term.add(user_input, ai_response)
if any(kw in user_input for kw in promotion_keywords):
summary = summarize(user_input + ai_response)
vector = embed(summary)
long_term.add(vector, summary)
def get_context(query):
recent = short_term.get_context()
vector_query = embed(query)
related = long_term.search(vector_query)
return f"【长期记忆】
" + concat(related) + "
【当前上下文】
" + concat(recent)Pros: Balances immediacy of recent turns with durability of important historical facts.
Cons: More complex to tune promotion criteria and to merge contexts.
Suitable for: Agents that need both fast reaction to recent inputs and recall of long‑term user preferences.
8. OS‑Style Memory Management (Swap‑like)
Models RAM + disk: a small active window lives in fast memory, while overflow turns are “paged out” to passive storage. When a query references paged‑out content, a page‑fault triggers retrieval and insertion back into active memory.
active_memory = Deque(maxlen=2) # fast, limited RAM
passive_memory = {} # persistent storage
turn_id = 0
def add_message(user_input, ai_response):
global turn_id
turn = f"User: {user_input}
AI: {ai_response}"
if len(active_memory) >= 2:
old_id, old_turn = active_memory.popleft()
passive_memory[old_id] = old_turn # page out
active_memory.append((turn_id, turn))
turn_id += 1
def get_context(query):
context = "
".join(t[1] for t in active_memory)
paged_in = ""
for id_, turn in passive_memory.items():
if any(word in turn.lower() for word in query.lower().split() if len(word) > 3):
paged_in += f"
(Paged in from Turn {id_}): {turn}"
return f"### Active Memory (RAM):
{context}
### Paged‑In from Disk:
{paged_in}"Pros: Clear separation of hot and cold data; scales to very long histories.
Cons: Requires reliable page‑fault detection; incorrect triggers can cause missed information or latency spikes.
Suitable for: Low‑latency assistants that still need occasional recall of distant conversation fragments.
Overall Comparison
The eight approaches trade off implementation simplicity, memory footprint, retrieval fidelity, and scalability. Full memory offers completeness at high cost; sliding windows are cheap but forgetful; relevance filtering and summarization keep important bits; vector stores and knowledge graphs provide semantic and structured recall; hierarchical and OS‑style schemes combine short‑term agility with long‑term depth.
In production LLM agents a hybrid solution—e.g., a sliding window plus a vector store or knowledge graph—usually delivers the best balance between performance and long‑term knowledge retention.
Reference implementation (GitHub): https://github.com/FareedKhan-dev/optimize-ai-agent-memory
AI Large Model Application Practice
Focused on deep research and development of large-model applications. Authors of "RAG Application Development and Optimization Based on Large Models" and "MCP Principles Unveiled and Development Guide". Primarily B2B, with B2C as a supplement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
