Artificial Intelligence 19 min read

How to Build a Reliable Long-Term Memory System for AI Agents

Designing a robust AI memory for long-running agents requires separating context from persistent storage, using markdown files, pre‑compaction flushing, hybrid vector‑BM25 retrieval, session pruning, and rebuildable SQLite indexes, ensuring explainable, editable, and portable recall while preventing context bloat and security leaks.

Architect

Jan 28, 2026

How to Build a Reliable Long-Term Memory System for AI Agents

Opening: AI memory challenges lie in retrieval and preservation

The difficulty of AI memory is not about storing data but about reliably retrieving and keeping it safe over long conversations. In Clawdbot, the design focuses on making long‑term memory stable so an agent can remain trustworthy after dozens of interactions.

TL;DR

Context : The bounded, expensive token window the model sees in a single request.

Memory : Persistent, unbounded markdown files on disk that can be versioned.

Memory files live under memory/YYYY-MM-DD.md (daily logs) and optionally MEMORY.md (selected long‑term facts).

At session start the agent loads today’s and yesterday’s daily files; MEMORY.md is loaded only for private sessions, never for group chats.

Writes use the ordinary memory_write API, making the memory auditable rather than a black box.

Pre‑compaction flush is the gate that writes pending data to durable memory before automatic compression.

Refreshes are silent (using NO_REPLY) so users do not see the intermediate output.

Retrieval tools: memory_search (semantic + BM25 hybrid) returns snippets with file path, line range, score, and provider info; memory_get reads specific markdown files with startLine and lines parameters.

Hybrid search combines vector similarity (weight 0.7) and BM25 keyword relevance (weight 0.3) using a simple weighted sum.

Each agent has a rebuildable SQLite index stored at ~/.clawdbot/memory/<agentId>.sqlite; any change in embedder, model, endpoint fingerprint, or chunk parameters triggers automatic rebuild.

Session pruning removes old tool results before each LLM call, preventing tool output from drowning important context. Soft trimming keeps head/tail of large results; hard clearing replaces the whole result with a placeholder.

1 | Separate Context and Memory

Context includes all content sent to the model (system prompts, dialogue history, tool results, attachments) and is limited by the model’s token window. Memory is any content persisted on disk that can be re‑loaded or retrieved later.

Useful debugging commands: /status – overview of the session and window usage. /context list – list injected items and their sizes. /context detail – detailed breakdown of files, skill lists, and tool schema overhead. /usage tokens – append token usage to the reply footer. /compact – manually trigger compression.

These commands make the model’s view of the context observable, which is essential for tuning the memory system.

2 | File‑System Memory Layout

Clawdbot stores memory in two layers: memory/YYYY-MM-DD.md – daily append‑only logs; the agent reads today’s and yesterday’s files at session start. MEMORY.md (optional) – curated long‑term facts; loaded only for private sessions, never for group chats, preventing personal context leakage.

Writing strategy mirrors this split:

Decisions, preferences, and durable facts go to MEMORY.md.

Routine conversational context goes to the daily file.

When a user says “remember this”, the note is flushed to disk immediately; mental notes that stay only in the model are lost after a restart.

3 | Pre‑Compaction Flush (Memory Flush)

Long conversations inevitably hit the context limit, so compression summarises old dialogue into compact JSONL entries. Before automatic compression, Clawdbot performs a silent “memory flush” that writes any pending durable information to the markdown files.

Configuration example (JSON‑like):

{
  agents: {
    defaults: {
      compaction: {
        reserveTokensFloor: 20000,
        memoryFlush: {
          enabled: true,
          softThresholdTokens: 4000,
          systemPrompt: "Session nearing compaction. Store durable memories now.",
          prompt: "Write any lasting notes to memory/YYYY-MM-DD.md; reply with NO_REPLY if nothing to store."
        }
      }
    }
  }
}

Key points:

The flush triggers when estimated tokens exceed contextWindow - reserveTokensFloor - softThresholdTokens.

Only one flush per compression cycle to avoid spamming.

If the workspace is read‑only ( workspaceAccess: "ro" or "none"), the flush is skipped.

When the model’s reply starts with NO_REPLY, Clawdbot suppresses both the final message and any streaming output.

4 | Hybrid Retrieval

Pure vector search struggles with high‑signal tokens, while BM25 excels at exact matches. Clawdbot blends both:

Vector similarity provides semantic recall.

BM25 (FTS5) ensures precise token hits.

Hybrid scoring formula (default weights 0.7 / 0.3):

finalScore = vectorWeight * vectorScore + textWeight * textScore

Implementation steps:

Fetch maxResults × candidateMultiplier candidates from each side.

Convert BM25 rank to a 0‑1 score: textScore = 1 / (1 + max(0, bm25Rank)).

Merge candidates by block ID and apply the weighted sum.

Configuration example:

{
  agents: {
    defaults: {
      memorySearch: {
        query: {
          hybrid: {
            enabled: true,
            vectorWeight: 0.7,
            textWeight: 0.3,
            candidateMultiplier: 4
          }
        }
      }
    }
  }
}

If full‑text search is unavailable, the system falls back to pure vector search; if the sqlite-vec extension is missing, it falls back to an in‑process cosine similarity.

5 | Index Management

Each agent has a dedicated SQLite index stored at ~/.clawdbot/memory/<agentId>.sqlite. The index is derived data and can be rebuilt automatically whenever any of the following changes:

Embedding provider

Model

Endpoint fingerprint

Chunking parameters

Embedding provider selection order:

Local model path exists → local OpenAI key available → openai Gemini key available → gemini Otherwise memory search is disabled until configuration is provided.

For offline‑first scenarios the default local embedder is

hf:ggml-org/embeddinggemma-300M-GGUF/embeddinggemma-300M-Q8_0.gguf

(≈0.6 GB). The node-llama-cpp runtime may require a one‑time pnpm approve-builds step.

6 | Session Pruning (Tool Result Trimming)

Beyond compression, tool results (web fetches, exec outputs, large files) can silently inflate the context. Session pruning runs before each LLM call and removes stale tool output while leaving the on‑disk *.jsonl history untouched.

Pruning modes:

Soft trim : For oversized tool results, keep the head and tail with an ellipsis ( ...) and skip image blocks.

Hard clear : Replace the entire result with a placeholder such as [Old tool result content cleared].

Default strategy uses a TTL‑aware mode ( cache-ttl) that only prunes when the last Anthropic call is older than the configured TTL (e.g., 5m). Example configuration:

{
  agent: {
    contextPruning: { mode: "cache-ttl", ttl: "5m" }
  }
}

Additional defaults: keepLastAssistants: 3 – protect recent assistant messages.

minPrunableToolChars: 50000

softTrim.maxChars: 4000

hardClear.placeholder: "[Old tool result content cleared]"

Diagram – Memory Closed Loop

8 | Minimal Viable File‑Based Memory System

Even without Clawdbot, the same principles can be applied:

Define memory files ( memory/YYYY-MM-DD.md for daily logs, MEMORY.md for curated facts) and treat them as versionable assets.

Include a pre‑compaction flush step in any write path (explicit “remember this”, near‑compression threshold, or session end).

Design retrieval to return only fragments with file+line references and restrict reads to a whitelist of memory paths.

Prefer hybrid search (vector + BM25) with simple, explainable weighting.

Make the index rebuildable (canonical markdown → derived SQLite/vector index).

Treat tool‑output bloat as part of the memory system and apply compression + pruning together.

Conclusion

The key takeaway is that a practical AI memory system should be file‑based, editable, and auditable; recall should be performed by targeted tool retrieval; long conversations rely on a “pre‑compaction flush” to guarantee durability; and tool noise is mitigated by session pruning. Stability, clear boundaries, and explainability outweigh theoretical perfection.

Hybrid Retrieval AI memory Context Compression ClawdBot session pruning

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Opening: AI memory challenges lie in retrieval and preservation

TL;DR

1 | Separate Context and Memory

2 | File‑System Memory Layout

3 | Pre‑Compaction Flush (Memory Flush)

4 | Hybrid Retrieval

5 | Index Management

6 | Session Pruning (Tool Result Trimming)

Diagram – Memory Closed Loop

8 | Minimal Viable File‑Based Memory System

Conclusion

Architect

How this landed with the community

Was this worth your time?

0 Comments

1 | Separate Context and Memory

2 | File‑System Memory Layout

3 | Pre‑Compaction Flush (Memory Flush)

4 | Hybrid Retrieval

5 | Index Management

6 | Session Pruning (Tool Result Trimming)

8 | Minimal Viable File‑Based Memory System