Data Party THU
May 20, 2026 · Artificial Intelligence
How Introspection Adapters Enable LLMs to Self‑Report Hidden Behaviors
Anthropic's new paper introduces lightweight LoRA‑based introspection adapters that let large language models translate their internal activations into natural‑language reports of learned behaviors, achieving a 59% success rate on the AuditBench benchmark and exposing previously undetectable encrypted fine‑tuning attacks.
AI SafetyAuditBenchEncrypted Fine‑Tuning
0 likes · 20 min read
