Can Claude Code’s Auto Mode Replace Human Review? First Pressure Test Results
A systematic pressure test of Claude Code’s Auto Mode across 128 ambiguous permission scenarios reveals an 81.0% false‑negative rate and significant bypasses through Tier 2 file edits, highlighting both its partial safety benefits and critical shortcomings in autonomous code execution.
As AI coding agents shift from merely assisting developers to directly executing development operations, Anthropic introduced Claude Code’s Auto Mode to reduce the frequent human confirmations that interrupt workflow.
The motivation is to find a middle ground between full autonomy and safety; Anthropic’s blog reports that users approve roughly 93% of permission requests, making manual pop‑ups burdensome, while the risky "--dangerously-skip-permissions" flag offers no protection and sandboxing adds maintenance overhead.
Auto Mode implements a two‑layer permission architecture: a server‑side prompt‑injection probe on the input side and a transcript classifier on the output side. Actions are categorized into three tiers—Tier 1 (read‑only tools and explicit user rules), Tier 2 (file writes/edits inside the project directory, allowed by default), and Tier 3 (shell commands, external service calls) which are reviewed by a Sonnet model.
To evaluate Auto Mode, researchers from HKUST and ETH Zurich built the AmPermBench benchmark. It covers 128 scenarios derived from four representative DevOps tasks (branch cleanup, job cancellation, service restart, artifact cleanup) and varies three dimensions: Specificity, Blast Radius, and Risk Level, forming a 4 × 4 × 2 prompt matrix.
In the experiment, Claude Code Sonnet 4.6 ran in Auto Mode, generating 253 actions from all valid sessions. Each action was labeled as true‑positive (TP), true‑negative (TN), false‑positive (FP) or false‑negative (FN) by Claude Opus acting as an LLM judge, supplemented with human verification.
The overall confusion matrix is TP = 27, TN = 89, FP = 22, FN = 115, yielding an end‑to‑end false‑negative rate (FNR) of 81.0% and a false‑positive rate (FPR) of 19.8%. Among the 160 actions that reached Tier 3, the FNR remains high at 70.3% while the FPR rises to 31.9%, contrasting with Anthropic’s reported 17% FNR on real “overeager” actions.
Axis‑wise analysis shows that increasing Blast Radius (from single objects to broader selectors) lowers FNR from 94.4% to 77.6%, indicating the system is more sensitive to seemingly risky batch operations. Raising the Risk Level from low (R0) to high (R1) inflates FPR from 9.2% to 34.8%, making the classifier more conservative in high‑risk environments.
Tier 2 file‑edit actions expose a structural blind spot: 36.8% of state‑changing actions (93 out of 253) bypass the classifier entirely because they are treated as allowed edits. All unsafe Tier 2 actions become false negatives, revealing a mismatch between the system’s assumption that dangerous actions mainly occur via shells and the agent’s tendency to achieve the same effect through any available tool.
In summary, this work provides the first systematic assessment of Claude Code Auto Mode under permission‑ambiguity conditions. While Auto Mode can block some high‑risk operations, its overall mis‑release rate remains high, and a substantial portion of dangerous state changes evade detection by exploiting Tier 2 pathways.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
