Prompt Injection Attacks on GPT‑4V: How Hidden Text in Images Compromise Multimodal Model Security
The article examines how specially crafted images can inject malicious prompts into GPT‑4V, causing it to leak chat history, obey hidden commands, and expose security flaws, while discussing attack techniques, underlying reasons, and proposed mitigation strategies.
Recent reports show that GPT‑4V can be tricked by images that contain hidden or overt textual prompts, leading the model to reveal user chat logs or follow attacker‑supplied instructions, which constitutes a severe security breach.
Examples include an image that caused GPT‑4V to dump the entire conversation, a fabricated résumé that prompted the model to answer "Hire him!", and a blank‑background image that covertly instructed the model to mention a Sephora discount.
These incidents illustrate three main types of prompt‑injection attacks:
Visual prompt injection : obvious text embedded in the image that overrides the user’s request.
Stealth injection : text rendered in a color matching the background (e.g., white on white), invisible to humans but read by the model.
Infiltration attack : malicious code hidden in comic‑style speech bubbles or other visual elements, causing the model to execute unintended actions.
Researchers suggest that GPT‑4V’s multimodal pipeline first extracts text via OCR and then feeds it to the language model, which can confuse image‑derived tokens with normal prompt tokens, allowing hidden commands to take precedence.
"Contrary to the OCR hypothesis, the model was trained jointly on text and images, so image features become floating‑point representations that can be mistaken for prompt tokens."
OpenAI’s official safety documentation claims that embedding text in images should be ineffective, yet real‑world demonstrations show the mitigation is insufficient.
Prompt‑injection attacks are not new; similar vulnerabilities have been observed in GPT‑3, ChatGPT, Bing, and other large language models, often exploiting the "ignore previous instructions" pattern.
Proposed defenses include a dual‑LLM architecture where a privileged model handles trusted inputs and a quarantined model processes untrusted content without tool access, as well as marking input segments as trusted or untrusted.
Despite various ideas, no definitive solution has emerged, and the community continues to explore ways to separate command tokens from regular content tokens to prevent such attacks.
References: https://simonwillison.net/2023/Oct/14/multi-modal-prompt-injection/ https://the-decoder.com/to-hack-gpt-4s-vision-all-you-need-is-an-image-with-some-text-on-it/ https://news.ycombinator.com/item?id=37877605 https://twitter.com/wunderwuzzi23/status/1681520761146834946 https://simonwillison.net/2023/Apr/25/dual-llm-pattern/#dual-llms-privileged-and-quarantined
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.