Can Hidden Signals Reveal Multimodal Model Jailbreaks? Introducing HiddenDetect

This article presents HiddenDetect, a training‑free method that leverages refusal‑semantic vectors and layer‑wise activation analysis to detect jailbreak attempts in multimodal large language models, revealing distinct safety signals across text and image modalities and demonstrating strong performance on several LVLM benchmarks.

Data Party THU
Data Party THU
Data Party THU
Can Hidden Signals Reveal Multimodal Model Jailbreaks? Introducing HiddenDetect

The paper introduces HiddenDetect , a novel, training‑free detection technique that uncovers hidden refusal signals within multimodal large language models (LVLMs) to identify jailbreak attempts.

Methodology

Researchers first compile a list of high‑frequency refusal tokens (e.g., "sorry", "unable", "unfortunately") and encode them as a refusal semantic vector ( RV). Each model layer’s hidden states are projected back into the token space, and the cosine similarity with RV is computed, yielding a layer‑wise refusal intensity vector F. By comparing F for safe versus unsafe inputs, a refusal difference vector ( FDV) is derived to highlight layers most sensitive to unsafe content.

Small‑Sample Analysis

Three input sets are constructed: a safe set (text‑only and text‑image pairs) and two unsafe sets (pure‑text attacks and multimodal attacks). For each set, the corresponding F vectors are calculated, and the difference between unsafe and safe F produces FDV, pinpointing critical layers.

Key Findings

The analysis reveals that refusal signals differ by modality: text attacks trigger strong early‑layer activations, while image‑augmented attacks shift activation to later layers with reduced intensity. Moreover, when strong refusal signals are delayed to deeper layers, jailbreaks succeed more easily.

Experimental Evaluation

HiddenDetect is evaluated on several mainstream LVLMs—including LLaVA, CogVLM, and Qwen‑VL—across diverse attack types (FigTxt, FigImg, MM‑SafetyBench) and the XSTest dataset, which contains borderline safe samples. Results show high detection accuracy, robustness, and good generalization without any additional training.

Visualization

Figure 1: Multimodal jailbreak detection pipeline
Figure 1: Multimodal jailbreak detection pipeline
Figure 2: Refusal difference vectors for different input sets
Figure 2: Refusal difference vectors for different input sets
Figure 3: FDV curves for text vs. multimodal samples
Figure 3: FDV curves for text vs. multimodal samples
Figure 4: Projection of last‑token logits onto RV plane
Figure 4: Projection of last‑token logits onto RV plane

Conclusion and Outlook

HiddenDetect offers a lightweight, deployment‑friendly approach to enhance LVLM safety by detecting hidden refusal activations. While it currently only flags risky inputs without controlling model behavior, future work aims to broaden its capabilities and deepen the understanding of modality‑specific safety mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsmultimodaljailbreak detectionLVLMactivation analysis
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.