Can LLMs ‘Squint’ to Recognize Hidden Faces? A Comparative Test
The article evaluates several large language models—including ChatGPT, Gemini, Grok, Qwen, and o3‑Pro—on a visual illusion that requires squinting to identify the Mona Lisa, revealing varied success rates, reasoning differences, and insights into model capabilities and limitations.
Background
A visual puzzle created by Japanese artist Kitagawa Akiyoshi appears ambiguous at first glance but reveals the Mona Lisa when the viewer squints. The experiment was designed to evaluate whether large‑language models (LLMs) with multimodal capabilities can solve such perception‑based visual reasoning tasks when given a hint.
Test Setup
The original image was supplied to each model together with a prompt encouraging it to look closely or to squint . In practice this can be simulated by asking the model to apply a blurring or low‑resolution transformation before attempting identification.
Models Evaluated
ChatGPT : Recognized the picture as a visual distortion and outlined the face, but misidentified the subject. Deep‑thinking mode did not produce a correct answer.
Gemini : Detected a side‑profile silhouette but guessed the wrong person.
Grok : Reported inability to recognize the image and requested a clearer version.
Doubao (Chinese model) : Similar to Gemini; after selecting “deep thinking” it incorrectly concluded the figure was Einstein.
Qwen3‑235B‑A22B : In deep‑thinking mode it noticed a side‑profile silhouette but could not name the person.
Yuanbao and iFlytek Spark : Produced responses (including images) that failed to identify the figure.
o3‑Pro : The only model that answered correctly on the first attempt. Its reasoning log shows a chain of image‑processing operations – rotation, contrast enhancement, cropping – performed via built‑in tools, after which the Mona Lisa became recognizable.
GPT‑4o : Required three attempts; the final correct answer appeared to be a lucky guess rather than systematic reasoning.
Key Observations
Most multimodal LLMs can detect that the input is a face silhouette but lack the ability to infer the identity without additional visual manipulation.
Enabling tool usage (e.g., Python‑based image transformations) markedly improves performance, as demonstrated by o3‑Pro.
Deep‑thinking or “analysis” modes do not automatically yield correct results; they often repeat the same misidentification.
Some models may appear to succeed by chance (e.g., GPT‑4o) rather than by a robust reasoning pipeline.
Practical Recommendations
When testing visual puzzles that rely on subtle perception tricks, explicitly ask the model to apply transformations such as blur(image, radius=5), rotate(image, angle=15), or adjust_contrast(image, factor=1.5) before identification.
Enable multimodal tool usage (Python, image‑processing libraries) in the model’s settings to allow it to manipulate the image internally.
Use prompts that describe the desired “squinting” effect, e.g., “Please blur the image slightly to simulate squinting and then tell me who is depicted.”
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
