PaperAgent
May 6, 2026 · Artificial Intelligence
How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%
Anthropic and MIT researchers reveal that large language models can sense injected steering vectors, a capability that emerges during post‑training (especially DPO), and they present a two‑stage detection circuit whose performance improves by up to 75% when reject directions are ablated or bias vectors are trained.
Circuit AnalysisDPOIntrospective Awareness
0 likes · 15 min read
