DeepSeek’s New Image‑Recognition Mode Struggles to Identify Its Own CEO
After DeepSeek fully launched its image‑recognition mode, a hands‑on test revealed that while the model can spot well‑known figures like Huang Renxun, it misreads text, fails on Chinese handwriting, cannot recognize its CEO Liang Wenfeng, and lags behind Gemini, GPT 5.5 and Claude in music‑theory reasoning.
DeepSeek rolled out the full version of its image‑recognition mode on both the mobile app and web platform just before the Dragon Boat Festival. The feature, previously limited to a gray‑scale test, is now publicly available.
Our team opened the app and tried several test images. The first case used a photo of Nvidia CEO Jensen Huang drinking soy‑milk on a Beijing snack street. DeepSeek correctly identified Huang but ignored the "豆汁" label on the bottle, mistakenly classifying the drink as milk, and its interpretation of Huang’s facial expression was inaccurate.
Switching to DeepSeek’s “deep thinking” mode did not improve the text‑recognition issue; the model still missed the "尹三豆汁" characters on the bottle, though it inferred the drink was "豆汁" based on world knowledge. Expression analysis remained unchanged.
Social‑media users also tested the model on other celebrities such as He Tongxue, encountering similar misidentifications. Notably, DeepSeek failed to recognize its own founder Liang Wenfeng; the model relied heavily on facial features and public image matching, which works poorly for less‑distinctive faces.
The model enforces strict safety checks: uploading a recent popular image of Lei Jun triggered a "possible policy violation" warning.
Handwritten Chinese tests showed the model recognized only 3 out of 7 characters, indicating limited capability in real‑world text recognition, domain‑specific vocabulary constraints, and semantic correction.
When presented with an artifact image, DeepSeek correctly identified the style as Mughal Empire and provided a detailed craft analysis, though it could not pinpoint the artifact’s exact origin.
In a visual puzzle requiring the identification of identical socks, DeepSeek failed to locate the correct pair (the correct answer being the first row third and third row second socks).
A piano‑chord test involved uploading a photo of a piano keyboard and asking for the chord. The model incorrectly answered, despite the logical deduction that the pattern "two black + three black" keys implies the chord ACE.
For comparison, we also queried Gemini 3.5 flash, GPT 5.5, and Claude Sonnet 4.6 on the same chord question. None answered correctly; Claude even stopped responding, highlighting the limited music‑theory reasoning of current large multimodal models.
The article concludes with open questions from DeepSeek’s multimodal team, such as the relationship between this mode and DeepSeek 4.1, whether it uses native multimodal architecture, and when a multimodal API will be released. These queries were posted by researcher Xiaokang Chen, who has not yet provided answers.
Future documentation from DeepSeek may clarify these technical details.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
