Artificial Intelligence 10 min read

Unlocking AI Agent Memory: How LLMs Use Retrieval and Planning to Stay Smart

This article explains the core architecture of AI agents powered by large language models, detailing how planning, short‑term and long‑term memory, and tool integration work together through vector databases, retrieval‑augmented generation, and summarization to enable stateful, intelligent interactions across multiple sessions.

Data Thinking Notes

Sep 7, 2025

Unlocking AI Agent Memory: How LLMs Use Retrieval and Planning to Stay Smart

Agent Core Architecture Overview

An AI Agent uses a large language model (LLM) as its brain and combines planning, memory, tool usage, and feedback loops into a cohesive system that can understand tasks, generate responses, and maintain state across interactions.

Brain (LLM) : Handles task understanding, planning, decision‑making, and generation. Implemented with base LLMs, domain‑fine‑tuned models, and prompt engineering.

Planning : Decomposes goals, builds task chains, and creates strategies using chain‑of‑thought reasoning, task decomposition, and self‑reflection.

Memory : Provides state persistence and historical reference. Includes short‑term memory (conversation history) and long‑term memory (vector‑based retrieval).

Tools : Extends the agent’s capabilities by calling APIs, executing code, searching the web, or invoking custom functions.

The architecture operates in a continuous loop where each component feeds into the next, as illustrated in the diagrams below.

Memory Module Details

Key Implementation Techniques and Categories

Short‑Term Memory : Stores the recent conversation window by concatenating the last few turns (user messages, agent thoughts, tool calls, and results) directly into the next prompt.

Long‑Term Memory : Uses Retrieval‑Augmented Generation (RAG). Conversation snippets are embedded with an embedding model (e.g., text‑embedding‑ada‑002, BGE, M3E) and stored in a vector database (e.g., Pinecone, Chroma, Milvus, Qdrant). Retrieval is performed via similarity search and injected into the prompt.

Memory Summarization : Periodically summarizes lengthy dialogues using the LLM, compressing detailed short‑term memory into concise long‑term memory points stored in the vector DB.

The table below compares short‑term and long‑term memory characteristics:

Nature : Short‑term memory is linear and recent; long‑term memory is semantic and cross‑temporal.

Form : Short‑term stores raw dialogue lists in order; long‑term stores retrieved relevant fragments regardless of order.

Content : Short‑term contains the most recent verbatim exchanges; long‑term contains the most relevant historical fragments.

Technology : Short‑term uses simple list concatenation; long‑term relies on vector embeddings and similarity search.

Purpose : Short‑term maintains conversational continuity; long‑term provides historical insight for smarter reasoning.

Example workflow for a multi‑day user request demonstrates how the agent retrieves relevant past dialogues from long‑term memory and combines them with the current short‑term context to generate a comprehensive report.

f"""
{system_prompt}
"""
# 以下是从长期记忆中检索到的相关历史对话（memory_context）:
相关记录1: [用户: 请显示第一季度各产品类别的销售额。 / Agent: ...]
相关记录2: [用户: 哪个区域的电子产品销售最好？ / Agent: ...]
# 以下是当前对话的短期历史（conversation_history）:
[当前对话历史为空或只有问候]
用户: 为我们最好的产品类别生成一个年度报告。
助手:
"""

Overall Memory Architecture and Technical Solutions

Conversation History Management

The conversation_history list grows indefinitely in memory, so the system uses a fixed‑size deque to keep only the most recent turns, persists older history to an SQLite database, and periodically summarizes and archives it.

Memory Context System

To prevent unbounded growth in the vector store, the EnhancedAgentMemory class assigns importance scores to each memory fragment, applies eviction policies, merges highly similar fragments, and supports session‑aware retrieval.