Tagged articles
1 articles
Page 1 of 1
PaperAgent
PaperAgent
May 6, 2026 · Artificial Intelligence

How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%

Anthropic and MIT researchers reveal that large language models can sense injected steering vectors, a capability that emerges during post‑training (especially DPO), and they present a two‑stage detection circuit whose performance improves by up to 75% when reject directions are ablated or bias vectors are trained.

Circuit AnalysisDPOIntrospective Awareness
0 likes · 15 min read
How to Detect Introspective Awareness in LLMs – Boosting Detection Rates by 53% and 75%