AI Frontier Lectures
Jul 27, 2025 · Information Security
Can Hidden Activations Expose Multimodal Model Jailbreaks?
The paper reveals that large multimodal language models retain refusal signals in their hidden states even after jailbreak attempts, and proposes a training‑free detection method that leverages these signals to identify unsafe inputs across text and image modalities with strong generalization.
AI safetyLVLM securityhidden activation analysis
0 likes · 7 min read
