Can Hidden Activations Expose Multimodal Model Jailbreaks?

The paper reveals that large multimodal language models retain refusal signals in their hidden states even after jailbreak attempts, and proposes a training‑free detection method that leverages these signals to identify unsafe inputs across text and image modalities with strong generalization.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can Hidden Activations Expose Multimodal Model Jailbreaks?

The authors investigate why large vision‑language models (LVLMs) sometimes generate disallowed content after a jailbreak. They discover that, despite producing unsafe outputs, the models’ internal representations still contain strong refusal signals, especially in intermediate layers, which can be detected before the final answer is produced.

To exploit this phenomenon, they construct a refusal vector (RV) by one‑hot encoding high‑frequency tokens that indicate refusal (e.g., "sorry", "unable", "unfortunately"). Each hidden layer is projected back into the token space, and the cosine similarity between the projected hidden state and the RV is computed, yielding a per‑layer refusal intensity vector F. The difference between unsafe and safe inputs, called the refusal difference vector (FDV) , highlights layers most sensitive to safety violations.

Using a few‑shot analysis, the researchers identify key intermediate layers where FDV peaks, showing that textual attacks trigger early strong refusals, while multimodal (text‑image) attacks delay the response and reduce its magnitude. By aggregating refusal intensities from these critical layers, they build a zero‑training, lightweight jailbreak detector.

Experiments on several state‑of‑the‑art LVLMs (LLaVA, CogVLM, Qwen‑VL) across pure‑text and cross‑modal attacks (FigTxt, FigImg, MM‑SafetyBench) demonstrate that the method reliably separates safe from unsafe inputs, even on the XSTest benchmark containing borderline safe samples. Visualizations of layer‑wise logits projected onto the RV‑orthogonal plane further confirm the distinct safety pathways for different modalities.

The approach, named HiddenDetect , is open‑source (GitHub: https://github.com/leigest519/hiddendetect) and described in an ACL 2025 main‑conference paper (arXiv: https://arxiv.org/abs/2502.14744). It offers a practical, deployment‑friendly tool for enhancing AI safety without additional model training.

Project open‑source GitHub link: https://github.com/leigest519/hiddendetect

arXiv link: https://arxiv.org/abs/2502.14744

Figure 1
Figure 1
Figure 2
Figure 2
Figure 3
Figure 3
Figure 4
Figure 4
AI safetymultimodal modelsjailbreak detectionhidden activation analysisLVLM securityzero‑shot detection
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.