Artificial Intelligence 19 min read

Memory Mechanisms in Agent Harness: Current Landscape and Challenges

The article surveys memory mechanisms across major Agent Harness frameworks, classifies three memory types, evaluates each system’s implementation, highlights benchmark shortcomings, and presents Mem0 as a unified solution that overcomes capacity, retrieval, and isolation limitations.

AI Architecture Hub

Jun 5, 2026

Memory Mechanisms in Agent Harness: Current Landscape and Challenges

1. Overview of Agent Harness and Memory

Agent Harness is the runtime environment for AI software. Tools such as Cursor, Devin, Claude Code, and Codex rely on this environment for context management, tool scheduling, and agent collaboration, and increasingly for memory management. The memory mechanism is the most difficult component of execution‑framework design.

2. Three Forms of Memory

Working memory : information kept in the context window during a session and reset when the session ends; also includes compression when the window overflows.

External memory : persistent data stored outside model weights (vector stores, knowledge graphs, files); can survive across sessions and does not alter model weights. By 2026 almost all production‑grade memory uses this form.

Parameterized memory : knowledge encoded into model weights via gradient descent during framework training; relies on rule generalisation rather than instance retrieval. No production deployments reported in 2026.

The paper "Contextual Agent Memory Is Only a Memo, Not Real Memory" (arXiv:2604.27707) establishes a theoretical limit: retrieval‑based memory needs Ω(k²) stored instances to match the effect of parameterised memory’s O(d) weight updates. All systems discussed are subject to this bound.

3. Memory Solutions in Mainstream Frameworks

1. Anthropic – Claude Code

Uses a dual‑path approach: CLAUDE.md: a hand‑crafted configuration file loaded at session start.

Automatic memory: notes generated by the agent are stored under ~/.claude/projects/<repo>/memory/ with MEMORY.md as an index (max 200 lines / 25 KB) divided into user info, feedback, project, and reference categories.

Retrieval is limited to a lightweight model that selects files by name and description, without vector embeddings, loading at most five files per turn; excess files are silently truncated. Shortcomings: file‑name‑only selection, no semantic search, team‑wide sharing relies on a TEAMMEM marker, and the memory remains a local Markdown store.

2. Anthropic – Managed Agents

Sessions are stored as append‑only event logs, immutable and auditable. Memory is mounted at /mnt/memory/ (up to eight 100 KB units per workspace). All writes are versioned; multiple agents can share the same storage, preserving history instead of causing conflicts. Shortcomings: designed for workspace‑level collaboration, not long‑term personal memory; 100 KB per unit limits cross‑session personal context without additional development.

3. OpenAI – Codex

Memory consists of Markdown files under ~/.codex/memories/ (no SQLite, no vector embeddings). The system prefers memory_summary.md, falls back to MEMORY.md, and optionally reads raw_memories.md, skills/, and rollout_summaries/ when features.memories is enabled.

Write process has two stages:

During a single turn, after six hours of inactivity Codex extracts information and sanitises keys, writing to a local state database (not yet to the memory directory).

In a global merge phase, a merging sub‑agent locks the store, consolidates, corrects, or deletes entries, producing diff records. The store caps at 256 turns, 30‑day TTL, and rate limits.

Retrieval lacks semantics: summaries are truncated to 5 000 tokens, and the remaining data is matched by plain text substring search. Shortcomings: silent truncation, no semantic retrieval, six‑hour idle threshold can prevent merging, local‑only storage, and no EU/UK/Switzerland availability.

4. GitHub – Copilot

Features “instant citation verification”. Memory items are structured objects (topic, factual content, file‑line reference, reasoning logic). Before execution the agent checks current code branches, automatically rewrites conflicting memory, and memory expires after 28 days.

Published A/B test shows a statistically significant lift (p < 0.00001): PR merge rate rises from 83 % to 90 %, code‑review precision improves by 3 %, recall by 4 % – the only publicly disclosed production‑grade memory effectiveness metric.

Shortcoming: the citation schema cannot represent fact‑free preferences (e.g., “prefer minimal abstraction”) and is limited to a single code repository.

5. OpenClaw

Native memory stores selected MEMORY.md files and date‑archived logs under ~/.openclaw/workspace/. Each agent has a dedicated SQLite index and supports hybrid retrieval (≈70 % vector, 30 % BM25), providing semantic search out of the box.

Shortcoming: persistence logic is inconsistent; when the context window overflows, the system silently triggers an internal round, letting the model decide what to write, leading to non‑uniform long‑term selection.

6. NousResearch – Hermes

Provides three memory layers plus eight plug‑in extensions:

Working memory : MEMORY.md (2 200 characters) + USER.md (1 375 characters), ~1 300 tokens total, segmented with usage metrics; auto‑merges at 80 % capacity. Writes are local, but prompts snapshot the state for the next session.

Skill memory : process documents generated after five+ tool‑call tasks, periodically curated.

Session retrieval : full‑session search via SQLite FTS5, on‑demand summarisation.

Shortcomings: very low persistent capacity (~800 tokens); FTS5 only supports keyword search (e.g., “429 error” won’t match “rate limit”); local‑only storage. Integration with Mem0 can lift capacity limits, add semantic retrieval, and provide user‑ID isolation.

7. AWS – Bedrock AgentCore

Cloud‑hosted agent platform; Runtime corresponds to an execution‑framework layer (similar to Anthropic Managed Agents). Memory is a managed service offering three asynchronous extraction strategies (semantic facts, preferences, narrative summaries). Extraction takes ~20–40 s, retrieval ~200 ms. Fact changes are marked invalid rather than deleted, preserving provenance. Public benchmark scores: LoCoMo 70.58, PrefEval 79, PolyBench‑QA 83.02.

Shortcoming: tightly bound to the AWS ecosystem; published LoCoMo scores are far below leading memory systems.

8. Windsurf

Memory is automatically generated and managed by the Cascade engine, stored as local files under ~/.codeium/windsurf/memories/, recording code‑base patterns and development conventions.

Shortcoming: memory content is decided by the engine, not the developer; only workspace‑level isolation, no cross‑device or team sharing.

9. Cognition – Devin

Memory split into two categories:

Knowledge memory : hand‑picked trigger facts (no automatic capture).

DeepWiki : reference documents (≈30 pages, 100 notes, each ≤10 000 characters).

After a session Devin recommends items for storage; they are persisted only after manual review. Shortcomings: review adds overhead and prevents teams without review from accumulating memory; capacity is conservative; knowledge memory is tailored to Devin and not portable.

4. Benchmark Shortcomings

Common industry benchmarks suffer from severe limitations. LoCoMo evaluates only 10 dialogue rounds, leading to unreliable comparisons; many questions can be answered by simple text matching, inflating scores. LongMemEval expands to 500 curated questions covering five abilities (information extraction, cross‑session reasoning, temporal reasoning, knowledge update, proactive refusal) and 1.5 M tokens, offering more practical value.

However, existing benchmarks do not test dimensions highlighted by recent research:

MemoryArena (arXiv:2602.16313) stresses actual memory behaviour; systems that saturate LoCoMo and LongMemEval fail here.

"Agent Memory Analysis" (arXiv:2602.19320) shows that current scores saturate and only measure similarity, not task utility.

Production‑scale testing is also missing: standard benchmarks cap at ~1.5 M tokens, while production agents exceed 10 M tokens; only the BEAM benchmark (ICLR 2026) scales to that magnitude.

5. Common Defect Patterns Across Frameworks

Storage capacity limits and local‑only deployment (e.g., Claude Code 25 KB, Hermes 2 200 characters, Codex 5 000‑token load limit).

Predominantly keyword‑based retrieval; only OpenClaw and AgentCore provide semantic search, the former with local constraints, the latter as a cloud service.

Memory tightly bound to the execution framework, preventing reuse across frameworks (e.g., Claude Code memory cannot be reused in Codex).

Almost no expiration handling (Copilot is the sole exception).

Post‑hoc isolation mechanisms cause frequent information‑pollution incidents.

These are inherent limitations of current execution‑framework boundaries.

6. Mem0 – A Cross‑Framework Memory Infrastructure

Mem0 is designed to break framework boundaries. It adopts a hybrid architecture: a vector store for semantic retrieval, a knowledge‑graph layer for relational reasoning, and a key‑value store for fast metadata access.

Version 3 (released April 2026) introduces a single‑pass extraction pipeline, multi‑signal retrieval (semantic + BM25 + entity linking) completed in one round, and embeds entity links into the vector store, removing the external graph database used in v2.

Performance: a single query processes ~6 900 tokens in 1.44 s, compared with full‑context retrieval of ~26 000 tokens in 17.12 s, a substantial efficiency gain.

Mem0 addresses the common defects:

Unlimited external storage capacity.

Multi‑signal retrieval that finds relevant history even when phrased differently.

Identity‑based isolation that eliminates 57 %–71 % of cross‑user information‑pollution.

Integration: adapters for all surveyed frameworks (Claude Code plugin, Codex MCP service, Hermes & OpenClaw native extensions, AWS Strands native integration). Compatible with 21 frameworks and 20 vector‑store backends, making memory a portable infrastructure rather than a framework‑specific add‑on.

7. Industry Summary

Memory has become a core infrastructure for AI agents; all major execution frameworks now implement some form of memory, yet they share fundamental bottlenecks: limited local capacity, keyword‑centric retrieval, tight framework coupling, weak expiration policies, and isolation flaws. Benchmarking practices are immature, and production‑scale tests are scarce.

Mem0 aims to fill these gaps by providing a migratable, semantically searchable, cross‑agent, production‑scale memory system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Agents benchmark memory Semantic Retrieval external memory mem0 Agent Harness

Written by

AI Architecture Hub

Focused on sharing high-quality AI content and practical implementation, helping people learn with fewer missteps and become stronger through AI.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Overview of Agent Harness and Memory

2. Three Forms of Memory

3. Memory Solutions in Mainstream Frameworks

1. Anthropic – Claude Code

2. Anthropic – Managed Agents

3. OpenAI – Codex

4. GitHub – Copilot

5. OpenClaw

6. NousResearch – Hermes

7. AWS – Bedrock AgentCore

8. Windsurf

9. Cognition – Devin

4. Benchmark Shortcomings

5. Common Defect Patterns Across Frameworks

6. Mem0 – A Cross‑Framework Memory Infrastructure

7. Industry Summary

AI Architecture Hub

How this landed with the community

Was this worth your time?

0 Comments

1. Anthropic – Claude Code