Data Party THU
Aug 11, 2025 · Artificial Intelligence
Can Hidden Signals Reveal Multimodal Model Jailbreaks? Introducing HiddenDetect
This article presents HiddenDetect, a training‑free method that leverages refusal‑semantic vectors and layer‑wise activation analysis to detect jailbreak attempts in multimodal large language models, revealing distinct safety signals across text and image modalities and demonstrating strong performance on several LVLM benchmarks.
LVLMactivation analysisjailbreak detection
0 likes · 7 min read
