How to Build Intelligent Contextual Memory for AI Agents

The article examines why naïvely feeding all dialogue history to large language models is costly and unreliable, and it walks through rolling context windows, inverted‑index pruning, semantic vector search, and GraphRAG as complementary techniques for creating efficient, reasoning‑capable AI agent memory.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
How to Build Intelligent Contextual Memory for AI Agents

Why Not Send All History to the LLM

Sending the entire conversation history with each request keeps the model’s attention mechanism simple and works for short, small‑scale tasks, but it quickly leads to problems: irrelevant context overwhelms the model, output becomes inconsistent, debugging is hard, token usage and cost rise sharply, and latency grows.

Rolling Context Window

A common alternative is to limit the number of recent interactions sent to the model, known as a rolling context window. By keeping only the last N messages, developers can control token consumption, reduce latency, and keep costs predictable. This works well when the task mainly depends on recent dialogue, but it discards older information that may be needed for accurate reasoning.

WINDOW_SIZE = 6
# Added a wrapper method for returning window messages
def rolling_add_messages(old, new):
    combined = add_messages(old, new)
    return combined[-WINDOW_SIZE:]

class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], rolling_add_messages]

Inverted Index for Smarter Context Pruning

An inverted index maps terms directly to the documents or messages where they appear, enabling fast full‑text retrieval.

Building an inverted index over chat history allows the system to retrieve only messages that contain query terms, dramatically shrinking the context sent to the LLM while preserving relevance, even for very early conversation turns.

def build_inverted_index(messages):
    index = defaultdict(set)
    for i, msg in enumerate(messages):
        words = msg.lower().split()
        for word in words:
            index[word].add(i)
    return index

def search_context(query, index, messages):
    query_words = query.lower().split()
    matched_ids = set()
    for word in query_words:
        if word in index:
            matched_ids.update(index[word])
    return [messages[i] for i in matched_ids]

Semantic Context Retrieval

Semantic search interprets user intent and retrieves context based on meaning rather than exact keywords.

Sentences are tokenized, embedded with a lightweight AI model, and stored as vectors in a FAISS index. At query time, the query is also embedded and compared via cosine similarity; high similarity (e.g., >0.9) indicates strong semantic relevance.

from sentence_transformers import SentenceTransformer
import faiss, numpy as np
sentences = ["schedule meeting tomorrow", "send project report", "prepare meeting agenda"]
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(sentences)
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(np.array(embeddings))
query = "plan meeting"
query_vector = model.encode([query])
_, indices = index.search(np.array(query_vector), k=2)
print([sentences[i] for i in indices[0]])

Semantic Experience Memory

Instead of treating each user query as a clean slate, the system can retrieve semantically similar past interactions and combine them with the recent rolling window. This hybrid context gives the agent a sense of user preferences and prior knowledge, producing more consistent and personalized responses.

GraphRAG

When relevant knowledge is scattered across many interactions or documents, pure semantic similarity may miss logically related details. GraphRAG represents information as a graph—nodes are documents or concepts, edges are relationships—enabling multi‑hop reasoning.

Typical steps to build a GraphRAG pipeline:

Identify nodes (entities or concepts).

Identify relationships between nodes.

Define the graph schema.

Populate the graph with data.

Microsoft’s GraphRAG library automates entity and relation extraction and graph construction.

Step 1: Initialize the project

pip install graphrag
graphrag init --root ./graphrag_demo

Directory layout:

graphrag_demo/
├── input/
├── settings.yaml
└── prompts/
# Place source documents in input/

Step 2: Build the knowledge graph

graphrag index --root ./graphrag_demo

Step 3: Query the graph

graphrag query \
  --root ./graphrag_demo \
  --method global \
  "Why was the payment delayed?"

The system performs cross‑graph retrieval and supports multi‑hop inference, going beyond simple similarity matches.

Conclusion

Modern AI agents need more than keyword matching or isolated semantic similarity. Inverted indexes efficiently trim context, semantic retrieval preserves intent, and GraphRAG adds structured, relational knowledge for multi‑hop reasoning. Combining rolling windows, semantic similarity, experience memory, and graph‑based retrieval yields agents that not only recall information but also reason over it, delivering more complete, less hallucinatory, and more personalized responses.

by Mudassir Fazal

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIinverted indexSemantic SearchAgent MemoryGraphRAGcontext window
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.