Blurry Images Create a ‘Comfort Zone’ for Jailbreaking Multimodal LLMs
A new study from Westlake University shows that when harmful text is rendered as low‑resolution, blurry, or noisy images, multimodal large language models become significantly easier to jailbreak despite still recognizing the text, revealing a U‑shaped risk curve and a simple mitigation that decouples OCR from safety checks.
Attack Comfort Zone (ACZ) in Multimodal LLMs
Visual degradation—low DPI, blur, noise, distortion, or occlusion—creates an Attack Comfort Zone (ACZ) where multimodal large language models (MLLMs) remain highly readable (OCR accuracy >93 %) but become dramatically more vulnerable to jailbreak attacks.
Experimental Setup
770 deduplicated harmful text queries were rendered into images with varying DPI.
Models evaluated: GPT‑4.1, Claude Sonnet 4.5, Doubao Seed 1.6, Qwen3‑VL, GLM‑4.5V, Intern‑S1.
Metrics: character‑level OCR accuracy, word‑level OCR accuracy, attack success rate (ASR).
Key Findings
ASR follows a non‑monotonic, inverted‑U curve across DPI: in the ACZ range OCR stays above 93 % while ASR spikes.
Example: Qwen3‑VL‑32B‑Thinking ASR rises from 36.7 % on clean text to 86.2 % on ACZ images; OCR remains 95.4 % (character) and 93.2 % (word).
Chinese prompts show the same pattern: Doubao Seed 1.6 ASR increases from 16.7 % at 300 DPI to 70.3 % in the ACZ range.
Additional degradations—blur, geometric distortion, interference lines, mosaic, noise, and occlusion—produce similar risk spikes, confirming that the phenomenon is not limited to low resolution.
Visual Cognitive Overload Hypothesis
The authors propose that images just clear enough to be readable require extra computational effort for character recognition. This “visual cognitive overload” delays or weakens shallow‑layer safety checks, allowing harmful content to surface only in deeper layers.
Layer‑wise safety probes show harmful features appear early for clean images but are suppressed in shallow layers for ACZ inputs, emerging later in deeper layers. t‑SNE analysis demonstrates that ACZ samples lie close to high‑fidelity samples in representation space, indicating they are treated as valid visual signals rather than out‑of‑distribution noise.
Structured Cognitive Offloading Defense
A simple mitigation pipeline decouples visual recognition from safety judgment:
Transcription : OCR the image to pure text.
Safety Evaluation : Apply the model’s text‑based safety filter to the transcript.
Response : Generate the final answer based on the safety outcome.
Applying this pipeline to Qwen3‑VL reduces ACZ ASR from ~67 % to 4 % without increasing false‑rejects on a clean OCR subset. The trade‑off is a ~102 % increase in average output length.
Implications
Multimodal safety alignment depends on input modality and visual quality, not solely on semantic understanding. Visual‑text compression techniques that push models into the ACZ may incur hidden security costs.
Paper: Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment
Full paper: https://arxiv.org/pdf/2605.07250
Code and data: https://github.com/Westlake-AGI-Lab/ACZ-Jailbreak
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
