Information Security 9 min read

Prompt Injection Attacks on GPT‑4V: How Hidden Text in Images Compromise Multimodal Model Security

The article examines how specially crafted images can inject malicious prompts into GPT‑4V, causing it to leak chat history, obey hidden commands, and expose security flaws, while discussing attack techniques, underlying reasons, and proposed mitigation strategies.

IT Services Circle

Oct 16, 2023

Prompt Injection Attacks on GPT‑4V: How Hidden Text in Images Compromise Multimodal Model Security

Recent reports show that GPT‑4V can be tricked by images that contain hidden or overt textual prompts, leading the model to reveal user chat logs or follow attacker‑supplied instructions, which constitutes a severe security breach.

Examples include an image that caused GPT‑4V to dump the entire conversation, a fabricated résumé that prompted the model to answer "Hire him!", and a blank‑background image that covertly instructed the model to mention a Sephora discount.

These incidents illustrate three main types of prompt‑injection attacks:

Visual prompt injection : obvious text embedded in the image that overrides the user’s request.

Stealth injection : text rendered in a color matching the background (e.g., white on white), invisible to humans but read by the model.

Infiltration attack : malicious code hidden in comic‑style speech bubbles or other visual elements, causing the model to execute unintended actions.

Researchers suggest that GPT‑4V’s multimodal pipeline first extracts text via OCR and then feeds it to the language model, which can confuse image‑derived tokens with normal prompt tokens, allowing hidden commands to take precedence.

"Contrary to the OCR hypothesis, the model was trained jointly on text and images, so image features become floating‑point representations that can be mistaken for prompt tokens."

OpenAI’s official safety documentation claims that embedding text in images should be ineffective, yet real‑world demonstrations show the mitigation is insufficient.

Prompt‑injection attacks are not new; similar vulnerabilities have been observed in GPT‑3, ChatGPT, Bing, and other large language models, often exploiting the "ignore previous instructions" pattern.

Proposed defenses include a dual‑LLM architecture where a privileged model handles trusted inputs and a quarantined model processes untrusted content without tool access, as well as marking input segments as trusted or untrusted.

Despite various ideas, no definitive solution has emerged, and the community continues to explore ways to separate command tokens from regular content tokens to prevent such attacks.

References:

https://simonwillison.net/2023/Oct/14/multi-modal-prompt-injection/

https://the-decoder.com/to-hack-gpt-4s-vision-all-you-need-is-an-image-with-some-text-on-it/

https://news.ycombinator.com/item?id=37877605

https://twitter.com/wunderwuzzi23/status/1681520761146834946

https://simonwillison.net/2023/Apr/25/dual-llm-pattern/#dual-llms-privileged-and-quarantined

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

prompt injection AI safety GPT-4V image attacks multimodal security

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.