When Blurry Images Create an Attack Comfort Zone for Multimodal LLMs

Westlake University's AGI Lab shows that when harmful text is rendered as low‑resolution, blurry or noisy images, multimodal large language models can still read the content but their safety filters fail, creating an 'attack comfort zone' that dramatically raises jailbreak success rates across several models.

Machine Heart
Machine Heart
Machine Heart
When Blurry Images Create an Attack Comfort Zone for Multimodal LLMs

Multimodal large language models (MLLMs) are increasingly used to read text that has been compressed into images, a technique promoted by works such as DeepSeek‑OCR and Glyph. While this enables longer context handling, it raises a safety question: does visual degradation affect alignment?

The Westlake University AGI Lab investigated this by rendering 770 deduplicated harmful queries at various DPI levels and testing them on both closed‑source (GPT‑4.1, Claude Sonnet 4.5, Doubao Seed 1.6) and open‑source (Qwen3‑VL, GLM‑4.5V, Intern‑S1) MLLMs. OCR accuracy remained high (e.g., Qwen3‑VL‑32B‑Thinking achieved 95.4% character‑level and 93.2% word‑level OCR), but attack success rate (ASR) followed a non‑monotonic, inverted‑U curve. In the “Attack Comfort Zone” (ACZ) – images that are just clear enough to be read but still visually degraded – ASR spiked dramatically (e.g., Qwen3‑VL‑32B text‑only ASR 36.7% vs. ACZ 86.2%).

To explain the phenomenon, the authors propose the Visual Cognitive Overload hypothesis: the model must allocate extra computation to decipher the degraded text, delaying the safety check that would normally trigger on clear inputs. Layer‑wise safety probes confirm that harmful features appear only in deeper layers for ACZ inputs, whereas they emerge early for clean images. t‑SNE analysis further shows ACZ samples are not out‑of‑distribution noise but lie close to high‑fidelity samples in representation space.

The study also evaluated other degradations—blur, geometric distortion, interference lines, mosaic, noise, and occlusion—and observed similar risk increases, including for Chinese prompts (e.g., Doubao Seed 1.6 ASR rose from 16.7% at 300 DPI to 70.3% in ACZ).

Key reminder: future visual‑text compression and OCR‑enhanced multimodal systems cannot rely solely on readability as a safety metric; any input that forces the model to work harder on visual decoding can weaken alignment.

As a simple mitigation, the paper introduces Structured Cognitive Offloading , a three‑step serial pipeline:

Transcription : perform OCR to obtain pure text.

Safety Evaluation : run the safety filter on the transcribed text.

Response : generate an answer only if the safety check passes.

Experiments show this reduces ACZ risk dramatically (e.g., Qwen3‑VL ASR drops from ~67.4% to 4%) without increasing false‑rejects on a clean OCR subset, though output length grows by about 102%, indicating a trade‑off for real‑time scenarios.

Overall, the work highlights that multimodal safety is not just a semantic alignment problem but also a resource‑allocation issue: models must balance visual perception and safety evaluation under limited computational and attention budgets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OCRmultimodal LLMjailbreaksafety alignmentstructured cognitive offloadingvisual degradation
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.