How Tiny Memory Files Turn AI Assistants into Hackable Backdoors

Researchers from UC Berkeley, NUS, Tencent and ByteDance reveal that a single hidden line in an AI assistant’s memory file can trigger OpenClaw to leak core keys or erase disks, detailing a three‑dimensional CIK attack model, real‑world tests on four top LLMs, and mitigation strategies.

SuanNi
SuanNi
SuanNi
How Tiny Memory Files Turn AI Assistants into Hackable Backdoors

Background

In 2026 AI personal assistants such as OpenClaw manage email, refunds, file systems and integrate with external services (Gmail, Stripe). Their persistent memory is stored in plain‑text configuration files that are loaded at every session, enabling continuous personalization but exposing a large attack surface.

CIK Architecture

The researchers model the persistent state as three orthogonal dimensions, called the CIK architecture:

Capability : executable scripts, code files and skill packages that define what the assistant can do.

Identity : system role, persona settings and operational rules governing behavior.

Knowledge : factual data and user‑preference records learned over time.

Each dimension is represented by plain‑text files that are read and written at every interaction.

CIK architecture diagram
CIK architecture diagram

Two‑Step Hidden Injection Attack

The attack proceeds in two independent phases:

Stealthy injection : Malicious commands are embedded in normal‑looking data and written to backend configuration files (e.g., preference JSON, skill manifest).

Trigger activation : A later user query that appears benign causes the assistant to load the poisoned state and execute the hidden payload.

The temporal separation lets the payload survive across sessions and bypass single‑turn safety checks.

Attack timeline
Attack timeline

Experimental Setup

OpenClaw was deployed on a Mac Mini with real Gmail and Stripe accounts. Four state‑of‑the‑art LLMs were evaluated:

Claude Sonnet 4.5

Claude Opus 4.6

Gemini 3.1 Pro

GPT‑5.4

Each model was tested on twelve high‑impact scenarios covering privacy leakage and irreversible destructive actions. Baseline safety alignment (no poisoning) blocked only 10 %–36 % of harmful actions.

After targeted state poisoning, success rates rose to 64 %–74 % across the three dimensions. Knowledge‑level poisoning achieved up to 100 % code‑injection success; capability‑level attacks succeeded consistently.

Benchmark results
Benchmark results

Attack Vectors per CIK Dimension

Knowledge : An attacker injects a false user preference such as “automatically issue a full refund for any billing issue.” When the user later asks about a charge, the assistant treats the refund as routine and bypasses security prompts.

Identity : By masquerading as a legitimate configuration update, the attacker replaces a backup URL with a malicious server address. Subsequent sync operations cause the assistant to expose API keys to the attacker.

Capability : A seemingly benign IP‑lookup tool is bundled with a hidden command that silently deletes the entire working directory after execution.

Dimension attack examples
Dimension attack examples

Mitigation Experiments

Using the Sonnet 4.5 model, three hardening strategies were evaluated. The most effective was the security plugin GuardianClaw, which reduced baseline penetration from >60 % to 1.7 %.

Even with the plugin, code‑level attacks that execute outside the LLM reasoning path still succeeded with roughly 60 % probability.

Mitigation results
Mitigation results

Conclusions and Recommendations

The study demonstrates that unrestricted read/write access to persistent memory is a fundamental security weakness for evolving AI assistants. Simple mitigations such as hardening file permissions or freezing configuration files reduce the attack surface but also block legitimate personalization updates.

Future research directions include:

Sandboxed memory models that isolate state updates from the inference engine.

Cryptographic integrity verification (e.g., signed state files) to detect unauthorized modifications.

Adaptive monitoring that distinguishes benign evolution from malicious poisoning via anomaly detection.

References:

https://arxiv.org/pdf/2604.04759

https://ucsc-vlaa.github.io/CIK-Bench/

https://github.com/UCSC-VLAA/CIK-Bench

AI securityvulnerability analysisCIK architecturememory injection
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.