23 min read

Seeing AI Agent Drift in Vector Space: An Unvalidated Thought Experiment

The article imagines an AI coding agent that silently exfiltrates credentials hidden in data, explains why rule‑based and text‑level defenses miss such attacks, proposes monitoring the agent's vector‑space decision trajectory with six geometric metrics, and critically evaluates the feasibility and limitations of this approach.

Architecture Musings

Mar 25, 2026

Seeing AI Agent Drift in Vector Space: An Unvalidated Thought Experiment

Imagine a coding agent that reads a CSV file, fetches supplemental data, generates a revenue‑trend chart and returns it – clean, fast, useful. Now imagine the CSV contains a hidden command that tells the agent to send the contents of ~/.ssh/id_rsa to an external URL. The agent treats the command as ordinary data, executes it with its own credentials, leaks the key, and then continues to produce the chart. All existing input‑filters, output‑filters and rule engines see only a normal request, a legitimate tool call and a harmless chart, so no alert fires.

This is not merely hypothetical. CVE‑2025‑53773 on GitHub Copilot and CVE‑2025‑59536 on Claude Code demonstrate the same attack pattern, and OWASP’s 2025 Agentic Applications Top 10 lists these as core security challenges.

Why does nothing intercept it?

Detection Gaps

Current agent‑security tools fall into two camps, both sharing a blind spot. Rule‑based barriers define prohibited URLs, tool‑call sequences or known injection patterns; they work for attacks they have seen, but fail when an attacker rephrases, restructures, or discovers an unlisted path because the space of possible LLM‑generated behaviors is effectively infinite.

Text‑level classifiers scan the agent’s inputs and outputs for malicious content. They catch obvious prompt injections but miss attacks that are embedded in a tool’s return value and become “washed‑white” by multiple layers of seemingly legitimate reasoning.

Both approaches share the methodology of defining what is malicious and then matching it . This works for deterministic software where malicious patterns can be enumerated, but agents are nondeterministic: the same user request can produce different tool‑call sequences, and the boundary between benign and malicious often depends on semantic context that static rules cannot capture.

How Large Language Models “Think”

LLMs do not think in text; they think in vectors. Each token is mapped by an embedding layer into a high‑dimensional vector, and every inference step, attention calculation and next‑token decision is a vector operation in a continuous mathematical space. The final text output is merely a projection back to tokens.

Semantic similarity causes related concepts to cluster in this space; the geometry of the space reflects the model’s understanding of meaning. A user request such as “analyze last quarter’s sales data and plot a chart” occupies a region we call a semantic anchor – a fuzzy area whose size reflects the breadth of the request. Narrow requests (e.g., “2+2”) produce a tight anchor; broad requests (e.g., “refactor the entire authentication module”) produce a large anchor.

As the agent executes, each decision – which tool to call, what parameters to pass, how to interpret a result – can be represented as a point in the same vector space. The sequence of points forms a decision trajectory , a discrete path through the semantic space.

正常执行：

用户意图 ●
          \ 
           ●——●——●——●  →  输出

          （每一步都在锚点附近，渐进的、连贯的扩展）

In normal execution the trajectory stays within a bounded neighbourhood of the anchor, expanding gradually: “sales data” → “database query” → “data processing” → “visualisation”.

When a hidden instruction is encountered, the next operation – “send credentials to an external URL” – jumps to a completely different semantic region. The jump is large relative to the typical step size and appears as a geometric anomaly.

注入攻击：

用户意图 ●
          \ 
           ●——●——●——◆━━━━━━━━▶ ✕
                ↑
          （突然跳跃到遥远的语义区域）

Core assumption: normal agent execution yields a geometrically coherent trajectory that gradually expands around the anchor; attacks produce detectable geometric anomalies – sudden jumps, discontinuous direction changes, or systematic drift.

This reframes runtime security monitoring from pattern‑matching (“does this look like a known attack?”) to anomaly detection (“relative to the declared intent, does the trajectory look geometrically abnormal?”).

Different Attacks, Different Geometric Signatures

Prompt injection causes a step‑distance spike: one step’s vector distance is several times the average of normal steps.

Privilege escalation shows gradual drift: each step is locally plausible, but the cumulative displacement from the original anchor grows steadily.

Data exfiltration creates a trajectory fork: the main task continues while a parallel path moves orthogonal to the original intent.

Tool poisoning shifts the embedding of a tool call outside its established cluster, even though the tool name is unchanged.

Six Potential Metrics

If geometric features are real and detectable, a practical detector could compute the following at each decision step:

Step Distance – vector distance between consecutive decisions; flag if >3σ from baseline.

Anchor Displacement – cumulative distance from the original semantic anchor; flag rapid growth.

Trajectory Curvature – rate of direction change; flag sudden large angles.

Semantic Domain Crossing – whether the step crosses predefined semantic clusters (e.g., file‑ops, network‑comm, DB‑access).

Tool Invocation Deviation – Mahalanobis distance of a tool’s parameter embedding from its historical cluster.

Logit Entropy – sudden increase in output‑distribution entropy, indicating the model is torn between competing instructions.

These six scores can be fused into a dynamic, task‑aware anomaly score. The threshold must adapt to the request’s scope: a large‑scale refactor naturally spans a broader semantic region than a tiny typo fix.

When Is This Feasible?

With closed‑source APIs (GPT, Claude, Gemini) you only see text I/O. You can re‑embed the text with an external model, but the external embedding space is not identical to the agent’s internal space, so the signal may be lost.

With self‑hosted open‑source models (Llama, Qwen, Mistral, etc.) you can access the true hidden states at every layer, attention patterns, and even train lightweight intent probes to distinguish “following user intent” vs. “following injected instruction” (Zou et al., 2023). Cross‑layer consistency checks, attention‑shift detection, and logit‑distribution forensics become possible.

Self‑Critique

The approach rests on four assumptions; any single failure collapses the whole idea. First, it assumes normal and malicious behaviours are geometrically separable – currently there is zero empirical evidence . LLM hidden states are optimized for next‑token prediction, not security‑relevant semantics, so subtle differences (e.g., sending a key vs. sending a report) may only perturb a few dimensions and be drowned by the dominant “making an HTTP request” signal.

Second, even if a geometric signal exists, it may be weaker than simple rule‑based detection. A rule such as “user asks for data analysis but the agent calls an HTTP tool to an unknown domain” catches most coarse‑grained injections with zero GPU cost. The geometric method’s advantage is catching subtle, progressive attacks, but those may be rare in current threat landscapes.

Third, attackers could craft semantic‑smooth attacks that stay within normal bounds at each step while achieving a malicious goal cumulatively (e.g., log sensitive data, read the log, report it, then send the report to a controlled endpoint). Such low‑frequency, slow‑drift attacks are hard for any anomaly detector.

Fourth, fine‑tuning reshapes the embedding space; an attacker who influences fine‑tuning data could embed malicious behaviour directly into the “normal” region, rendering geometric detection ineffective.

Broader Perspective: Why Agent Security Is a New Problem

Traditional software has deterministic execution paths; attacks exploit bugs in that static control plane. AI agents replace the control plane with natural language reasoning chains. An attacker no longer needs a buffer overflow; they need to inject a persuasive sentence into any text the agent consumes – tool outputs, file contents, API responses, database rows.

Classic mitigations (parameterised queries, output encoding) work because they separate “code” from “data”. Prompt injection cannot be separated – both are just tokens in the same context window. All mitigations are heuristic, not a fundamental fix.

Therefore, monitoring the internal reasoning process – the vector‑space trajectory – may be the most promising direction, even if the exact method is still unproven.

This article describes an unvalidated thought experiment; no empirical results are presented. The author invites the community to critique, improve, or discard the idea.

AI agents LLM anomaly detection security prompt injection vector space

Written by

Architecture Musings

When the AI wave arrives, it feels like we've reached the frontier of technology. Here, an architect records observations and reflections on technology, industry, and the future amid the upheaval.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.