Can Hidden Activations Expose Multimodal Model Jailbreaks?
The paper reveals that large multimodal language models retain refusal signals in their hidden states even after jailbreak attempts, and proposes a training‑free detection method that leverages these signals to identify unsafe inputs across text and image modalities with strong generalization.
The authors investigate why large vision‑language models (LVLMs) sometimes generate disallowed content after a jailbreak. They discover that, despite producing unsafe outputs, the models’ internal representations still contain strong refusal signals, especially in intermediate layers, which can be detected before the final answer is produced.
To exploit this phenomenon, they construct a refusal vector (RV) by one‑hot encoding high‑frequency tokens that indicate refusal (e.g., "sorry", "unable", "unfortunately"). Each hidden layer is projected back into the token space, and the cosine similarity between the projected hidden state and the RV is computed, yielding a per‑layer refusal intensity vector F. The difference between unsafe and safe inputs, called the refusal difference vector (FDV) , highlights layers most sensitive to safety violations.
Using a few‑shot analysis, the researchers identify key intermediate layers where FDV peaks, showing that textual attacks trigger early strong refusals, while multimodal (text‑image) attacks delay the response and reduce its magnitude. By aggregating refusal intensities from these critical layers, they build a zero‑training, lightweight jailbreak detector.
Experiments on several state‑of‑the‑art LVLMs (LLaVA, CogVLM, Qwen‑VL) across pure‑text and cross‑modal attacks (FigTxt, FigImg, MM‑SafetyBench) demonstrate that the method reliably separates safe from unsafe inputs, even on the XSTest benchmark containing borderline safe samples. Visualizations of layer‑wise logits projected onto the RV‑orthogonal plane further confirm the distinct safety pathways for different modalities.
The approach, named HiddenDetect , is open‑source (GitHub: https://github.com/leigest519/hiddendetect) and described in an ACL 2025 main‑conference paper (arXiv: https://arxiv.org/abs/2502.14744). It offers a practical, deployment‑friendly tool for enhancing AI safety without additional model training.
Project open‑source GitHub link: https://github.com/leigest519/hiddendetect
arXiv link: https://arxiv.org/abs/2502.14744
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
