May 6, 2026 · Artificial Intelligence

How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%

Anthropic and MIT researchers reveal that large language models can sense injected steering vectors, a capability that emerges during post‑training (especially DPO), and they present a two‑stage detection circuit whose performance improves by up to 75% when reject directions are ablated or bias vectors are trained.

Circuit AnalysisDPOIntrospective Awareness

0 likes · 15 min read

How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%

Steering Vectors

How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%