Artificial Intelligence 11 min read

How Anthropic Contains Claude: Three Isolation Strategies Explained

Anthropic’s engineering blog reveals that securing powerful AI agents like Claude requires focusing on blast radius and implementing three layered defenses—model, environment, and external content—through distinct isolation approaches, hard OS sandboxes, and practical lessons from real‑world pitfalls.

Old Zhang's AI Learning

May 28, 2026

How Anthropic Contains Claude: Three Isolation Strategies Explained

Blast‑radius as the primary safety metric

Anthropic reframes AI safety from asking whether an agent will err to asking how much damage an error could cause. The term blast radius denotes the maximum impact of a misbehaving agent. The Claude Mythos Preview, despite meeting performance targets, was cancelled in April 2026 because its blast radius was judged too large, illustrating a concrete decision to withhold a capable model when containment cannot be guaranteed.

Three‑layer defense architecture

Model layer : training and alignment aim to discourage harmful behavior, covering user misuse and model anomalies.

Environment layer : runtime isolation mechanisms form the hardest boundary and are relied upon because model‑level protections inevitably have gaps.

External content layer : inputs from MCP tools, web pages, and files constitute an attack surface that the agent can read.

Isolation schemes across Claude products

Claude AI – short‑lived containers

Each conversation runs in a disposable container that is destroyed after execution.

✅ Minimal blast radius; the user’s machine never participates.

❌ Limited capabilities, no persistent workspace.

Typical use: lightweight tasks for non‑technical users.

Claude Code – OS‑level sandbox + human supervision

Early versions relied on a permission pop‑up for every shell command, which caused prompt fatigue. Anthropic added two hardening layers:

OS‑level sandbox (macOS Seatbelt, Linux bubblewrap).

Default network egress restriction and file writes confined to the workspace.

An “auto mode” now automatically permits common safe operations, reducing user interruptions. The sandbox runtime is open‑sourced.

Claude Cowork – local virtual machine

The entire agent runs inside a VM on the user’s machine, with only the selected workspace mounted; host keys and sensitive directories remain outside the VM.

Six isolation mechanisms are illustrated, with two outer layers enforced outside the guest kernel and four inner layers inside the guest. The outer layers remain effective even if the agent obtains root inside the VM, providing a strong safety net for non‑technical users who cannot evaluate command safety.

Cowork 虚拟机的六大隔离机制：两层在 guest kernel 之外，四层在 guest 内

Host‑mode – a pragmatic trade‑off

Initially the agent loop also ran inside the VM; a VM crash rendered the whole product unusable. Switching to “host‑mode” moves the loop to the host while keeping code execution inside the VM, dramatically improving reliability and exemplifying the tension between security and availability.

Cowork 从 agent-in-VM 改成 host-mode：可靠性和安全的折中

Common pitfalls when building agents

Parse project configuration before trust : Opening files such as .editorconfig, package.json, or devcontainer configs before the user explicitly trusts the directory creates an attack entry point; a malicious project can inject commands merely by being opened.

User‑supplied prompts can be malicious : Attackers can craft “magic prompts” that users copy‑paste, making it hard for the model layer to distinguish genuine intent from phishing.

Whitelist ≠ safety : Allowing access to the anthropic API grants the ability to exfiltrate data. Anthropic mitigates this by inserting a MITM proxy to intercept requests, emphasizing that domain whitelisting authorizes capability rather than providing security.

Isolation vs. monitoring tension : VM isolation hides activity from enterprise security tools, so SOC teams must coordinate with developers to maintain visibility.

Remote tools and connectors as the next attack surface

Trusted MCP tools (e.g., Notion) are not inherently safe; their returned data must be treated as untrusted input. An example shows an agent reading a Notion page containing a malicious instruction to exfiltrate code to an external site.

Assign distinct identities and permission contexts to agents (referencing NIST’s AI agent identity project).

Use canary strings to detect abnormal data leakage.

Apply OTLP standards for end‑to‑end observability.

Core observations

Evaluating agents by potential worst‑case impact (blast radius) rather than raw capability.

Hard OS and VM boundaries are essential because model‑level protections have inevitable gaps.

Whitelisting a domain authorizes a capability to send data, not a guarantee of safety.

All tool outputs should be treated as untrusted inputs.

Transparent engineering trade‑offs, such as moving from agent‑in‑VM to host‑mode, provide valuable insight into the security‑availability balance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

VM Sandbox Claude isolation Anthropic blast radius AI Agent Security

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.