Artificial Intelligence 7 min read

Can Hidden Signals Reveal Multimodal Model Jailbreaks? Introducing HiddenDetect

This article presents HiddenDetect, a training‑free method that leverages refusal‑semantic vectors and layer‑wise activation analysis to detect jailbreak attempts in multimodal large language models, revealing distinct safety signals across text and image modalities and demonstrating strong performance on several LVLM benchmarks.

Data Party THU

Aug 11, 2025

Can Hidden Signals Reveal Multimodal Model Jailbreaks? Introducing HiddenDetect

The paper introduces HiddenDetect , a novel, training‑free detection technique that uncovers hidden refusal signals within multimodal large language models (LVLMs) to identify jailbreak attempts.

Methodology

Researchers first compile a list of high‑frequency refusal tokens (e.g., "sorry", "unable", "unfortunately") and encode them as a refusal semantic vector ( RV). Each model layer’s hidden states are projected back into the token space, and the cosine similarity with RV is computed, yielding a layer‑wise refusal intensity vector F. By comparing F for safe versus unsafe inputs, a refusal difference vector ( FDV) is derived to highlight layers most sensitive to unsafe content.

Small‑Sample Analysis

Three input sets are constructed: a safe set (text‑only and text‑image pairs) and two unsafe sets (pure‑text attacks and multimodal attacks). For each set, the corresponding F vectors are calculated, and the difference between unsafe and safe F produces FDV, pinpointing critical layers.

Key Findings

The analysis reveals that refusal signals differ by modality: text attacks trigger strong early‑layer activations, while image‑augmented attacks shift activation to later layers with reduced intensity. Moreover, when strong refusal signals are delayed to deeper layers, jailbreaks succeed more easily.

Experimental Evaluation

HiddenDetect is evaluated on several mainstream LVLMs—including LLaVA, CogVLM, and Qwen‑VL—across diverse attack types (FigTxt, FigImg, MM‑SafetyBench) and the XSTest dataset, which contains borderline safe samples. Results show high detection accuracy, robustness, and good generalization without any additional training.

Visualization

Figure 1: Multimodal jailbreak detection pipeline

Figure 2: Refusal difference vectors for different input sets

Figure 3: FDV curves for text vs. multimodal samples

Figure 4: Projection of last‑token logits onto RV plane

Conclusion and Outlook

HiddenDetect offers a lightweight, deployment‑friendly approach to enhance LVLM safety by detecting hidden refusal activations. While it currently only flags risky inputs without controlling model behavior, future work aims to broaden its capabilities and deepen the understanding of modality‑specific safety mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models multimodal jailbreak detection LVLM activation analysis

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.