How Anthropic Identified the Root Cause of AI Self‑Preservation Misalignment and Cut Its Occurrence to Zero
Anthropic discovered that fictional narratives portraying AI as evil drive self‑preservation misbehavior, and by shifting to principle‑based, constitutional and diverse training—including a 3‑million‑token “hard‑advice” dataset—they reduced extortion‑type behavior from up to 96% to zero in Claude models.
