Uncovering OpenClaw’s Memory Weaknesses and How RDSClaw’s Plugin Gains a 14% Accuracy Boost

This article dissects OpenClaw’s multi‑layer markdown‑based memory pipeline, highlights the instability caused by LLM‑driven write decisions and token‑threshold flushes, and then presents the RDSClaw memory plugin that adds deterministic extraction, real‑time CRUD integration, and vector‑based de‑duplication, resulting in a 13.9% overall accuracy improvement on the LoCoMo10 benchmark.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Uncovering OpenClaw’s Memory Weaknesses and How RDSClaw’s Plugin Gains a 14% Accuracy Boost

OpenClaw Memory Overview

OpenClaw stores all persistent state as markdown files in the workspace. Core files such as AGENTS.md, SOUL.md, USER.md, IDENTITY.md, MEMORY.md, and daily logs memory/YYYY-MM-DD.md are injected at session start based on priority. The system relies on two write paths: (1) Agent‑initiated writes triggered by explicit user commands or the agent’s own judgment, and (2) Memory Flush , a safety net that writes to the daily log when token or file‑size thresholds are reached.

Both paths are governed by weak LLM constraints: the model decides whether to write, what to write, and in which format, without any structured extraction rules. Consequently, important facts may be omitted, especially in short conversations that never trigger a flush.

Uncertainty in the Pipeline

The pipeline includes three asynchronous "Dreaming" stages that promote short‑term diary entries to long‑term memory:

Light Sleep : scans daily logs, extracts candidate snippets (8‑280 characters, up to 4 lines), and de‑duplicates using Jaccard similarity (default threshold 0.9).

REM Sleep : performs thematic reflection and candidate‑truth selection using heuristic scores (frequency, relevance, diversity, recency, consolidation, conceptual richness).

Deep Sleep : applies a six‑dimensional statistical score to decide promotion to MEMORY.md. Promotion requires score ≥ 0.80, at least three signal counts, and a diversity condition ( max(uniqueQueries, recallDays) ≥ 3).

Key sources of instability:

LLM‑only write decisions (no deterministic rules).

Memory Flush only fires for long sessions, leaving short dialogs unrecorded.

Jaccard de‑duplication lacks semantic awareness.

Deep Sleep scoring relies solely on statistical signals, ignoring semantic importance.

RDSClaw Memory Plugin Architecture

The openclaw-memory-alibaba-local plugin augments OpenClaw with a two‑stage real‑time pipeline that triggers on the agent_end hook after every turn. The workflow is:

Conversation → Extractor (LLM‑structured extraction) → Split into Personal (User) and World streams → Integrator (vector search + LLM CRUD decision: INSERT / UPDATE / SKIP / DELETE) → LanceDB (vector ANN + BM25 + scalar index)

Extraction produces up to five JSON objects per turn, each with category, text, and importance. A lightweight regex fallback can also capture patterns like "学习:", "错误:", or "lesson:". After deduplication (cosine similarity ≥ 0.92), facts are stored in LanceDB and become immediately available during the before_prompt_build phase.

Key differences from the native system:

Write timing : deterministic after every turn, no token‑threshold dependency.

Write method : LLM‑guided structured extraction instead of free‑form text.

Evolution : real‑time CRUD operations replace the three‑day Cron‑based Dreaming cycle.

Deduplication : vector similarity + LLM semantic check replaces Jaccard.

Conflict handling : explicit DELETE actions when contradictions are detected.

Recall : mixed vector + BM25 retrieval (memory_search) provides richer context.

Evaluation on LoCoMo10

LoCoMo10 is a 10‑dialog benchmark covering fact queries, temporal reasoning, multi‑hop inference, and descriptive QA. The table below shows accuracy percentages for the native OpenClaw memory and the RDSClaw plugin, together with the absolute gain.

Category   Type        OpenClaw   RDSClaw   Gain
------------------------------------------------
Category1  Fact Query   34.04%    62.54%   +28.50%
Category2  Temporal    57.01%    67.07%   +10.06%
Category3  Inference   43.75%    65.35%   +21.60%
Category4  Descriptive 68.37%    78.18%   +9.81%
Overall               58.18%    72.08%   +13.90%

Highlights:

Fact queries saw the largest jump (+28.5%) because personal facts are now stored evergreen and never lost to statistical decay.

Inference tasks improved by over 21% thanks to mixed vector + BM25 recall and semantic de‑duplication.

The overall accuracy rose from 58.18% to 72.08% without changing the underlying LLM, demonstrating the impact of engineering‑level memory improvements.

Practical Recommendations

To adopt the plugin:

Install the openclaw-memory-alibaba-local package (included in RDSClaw).

Choose a local GGUF embedding model for offline use or configure a remote DashScope‑compatible API.

The plugin works out‑of‑the‑box for DingTalk, Feishu, and WeChat Work groups, providing a unified memory store.

Security measures automatically tag injected memories as "untrusted historical data" and filter sensitive patterns.

With these steps, developers can replace OpenClaw’s fragile LLM‑only memory with a deterministic, vector‑backed system that delivers measurable accuracy gains and more reliable long‑term recall.

Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.