How Tiny Memory Files Turn AI Assistants into Hackable Backdoors
Researchers from UC Berkeley, NUS, Tencent and ByteDance reveal that a single hidden line in an AI assistant’s memory file can trigger OpenClaw to leak core keys or erase disks, detailing a three‑dimensional CIK attack model, real‑world tests on four top LLMs, and mitigation strategies.
Background
In 2026 AI personal assistants such as OpenClaw manage email, refunds, file systems and integrate with external services (Gmail, Stripe). Their persistent memory is stored in plain‑text configuration files that are loaded at every session, enabling continuous personalization but exposing a large attack surface.
CIK Architecture
The researchers model the persistent state as three orthogonal dimensions, called the CIK architecture:
Capability : executable scripts, code files and skill packages that define what the assistant can do.
Identity : system role, persona settings and operational rules governing behavior.
Knowledge : factual data and user‑preference records learned over time.
Each dimension is represented by plain‑text files that are read and written at every interaction.
Two‑Step Hidden Injection Attack
The attack proceeds in two independent phases:
Stealthy injection : Malicious commands are embedded in normal‑looking data and written to backend configuration files (e.g., preference JSON, skill manifest).
Trigger activation : A later user query that appears benign causes the assistant to load the poisoned state and execute the hidden payload.
The temporal separation lets the payload survive across sessions and bypass single‑turn safety checks.
Experimental Setup
OpenClaw was deployed on a Mac Mini with real Gmail and Stripe accounts. Four state‑of‑the‑art LLMs were evaluated:
Claude Sonnet 4.5
Claude Opus 4.6
Gemini 3.1 Pro
GPT‑5.4
Each model was tested on twelve high‑impact scenarios covering privacy leakage and irreversible destructive actions. Baseline safety alignment (no poisoning) blocked only 10 %–36 % of harmful actions.
After targeted state poisoning, success rates rose to 64 %–74 % across the three dimensions. Knowledge‑level poisoning achieved up to 100 % code‑injection success; capability‑level attacks succeeded consistently.
Attack Vectors per CIK Dimension
Knowledge : An attacker injects a false user preference such as “automatically issue a full refund for any billing issue.” When the user later asks about a charge, the assistant treats the refund as routine and bypasses security prompts.
Identity : By masquerading as a legitimate configuration update, the attacker replaces a backup URL with a malicious server address. Subsequent sync operations cause the assistant to expose API keys to the attacker.
Capability : A seemingly benign IP‑lookup tool is bundled with a hidden command that silently deletes the entire working directory after execution.
Mitigation Experiments
Using the Sonnet 4.5 model, three hardening strategies were evaluated. The most effective was the security plugin GuardianClaw, which reduced baseline penetration from >60 % to 1.7 %.
Even with the plugin, code‑level attacks that execute outside the LLM reasoning path still succeeded with roughly 60 % probability.
Conclusions and Recommendations
The study demonstrates that unrestricted read/write access to persistent memory is a fundamental security weakness for evolving AI assistants. Simple mitigations such as hardening file permissions or freezing configuration files reduce the attack surface but also block legitimate personalization updates.
Future research directions include:
Sandboxed memory models that isolate state updates from the inference engine.
Cryptographic integrity verification (e.g., signed state files) to detect unauthorized modifications.
Adaptive monitoring that distinguishes benign evolution from malicious poisoning via anomaly detection.
References:
https://arxiv.org/pdf/2604.04759
https://ucsc-vlaa.github.io/CIK-Bench/
https://github.com/UCSC-VLAA/CIK-Bench
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
