Artificial Intelligence 10 min read

How DeepSeek’s New Paper Redefines Multimodal Reasoning with Visual Primitives

DeepSeek’s new paper "Thinking with Visual Primitives" tackles the reference gap in multimodal models by introducing points and boxes as reasoning units, achieving up to 8× token efficiency and leading benchmark scores in counting, spatial reasoning, and maze navigation compared with GPT‑5.4, Claude‑Sonnet‑4.6 and Gemini‑3‑Flash.

Old Zhang's AI Learning

May 4, 2026

How DeepSeek’s New Paper Redefines Multimodal Reasoning with Visual Primitives

DeepSeek’s paper "Thinking with Visual Primitives" identifies a critical "Reference Gap" in multimodal large language models: while they can perceive objects, they cannot consistently refer to them during reasoning.

The authors propose elevating spatial markers—points and bounding boxes—to the smallest reasoning units, interleaving them with text tokens in a chain‑of‑thought. Example token syntax is shown as <｜point｜>[[309,512]]<｜/point｜> or <｜box｜>[[x1,y1,x2,y2]]<｜/box｜>, anchoring logic to physical coordinates.

Architecture-wise, the model builds on DeepSeek‑V4‑Flash (284 B total parameters, 13 B MoE) and a custom DeepSeek‑ViT. Visual engineering details include a 3×3 spatial compression that merges nine adjacent patch tokens, Compressed Sparse Attention (CSA) that further reduces KV cache size by fourfold, and an overall compression ratio of 7 056× (a 756×756 image is represented by only 81 KV entries).

The training pipeline consists of four stages: expert SFT, specialized RL, unified RFT, and on‑policy distillation. Instead of using small datasets like COCO, the team collected massive web‑scale bounding‑box data because bbox annotations are deterministic, can generalize to points, and carry richer positional information. The RL stage for maze navigation employs detailed rewards (coverage, exploration completeness, wall‑accuracy, final path validity) that mimic a human solving process: mark start/end, explore, backtrack, and output the full path.

Benchmark results on 11 tasks show DeepSeek matching or surpassing GPT‑5.4, Claude‑Sonnet‑4.6, and Gemini‑3‑Flash while using only about one‑eighth of their visual token budget. Notable scores include CountQA EM/RA@10 = 64.9/74.1 (second only to Gemini‑3‑Flash), Pixmo‑Count = 89.2 (top), DS_Spatial_Reasoning = 98.7 (large margin), DS_Maze_Navigation = 66.9 (second vs 50.6), and DS_Path_Tracing = 56.7 (second vs 46.5). These results demonstrate that pure language chain‑of‑thought struggles with topological reasoning without visual primitives.

Token efficiency vs average score comparison

Case studies illustrate the approach:

Counting Pikachu – the model draws a box around each Pokemon and correctly outputs the count 6.

Chinese world‑knowledge query – despite no Chinese visual‑primitive data, the model locates the Golden Gate Bridge, recognizes San Francisco, and answers that the local NBA team is the Warriors.

Maze navigation – using <｜point｜> to mark start (green diamond) and goal (red tag), the model iteratively explores, backtracks at dead ends, and finally outputs a complete feasible path.

Chinese world knowledge reasoning example

Key observations from the author:

DeepSeek continues its "compression philosophy"—doing more with fewer tokens—across OCR and now visual‑primitive reasoning.

The Reference Gap is a more fundamental bottleneck than the Perception Gap; mainstream multimodal models collapse on dense counting, maze, and complex scene understanding because language cannot precisely refer to visual space.

Limitations include dependence on input resolution (fine‑grained scenes still suffer bias), reliance on explicit trigger words to activate the mechanism, and imperfect cross‑scene generalization of point‑based reasoning.

For developers, the approach is promising for applications such as complex chart/UI interpretation, dense object counting in retail or warehouse settings, topological reasoning on schematics, maps, or circuit diagrams, and robot path‑planning VLMs. If the DeepSeek‑V4‑VL model becomes open‑source, these scenarios could see a noticeable accuracy boost.

In summary, the paper’s greatest contribution is pinpointing the next hurdle for multimodal AI—shifting from "seeing clearly" to "thinking clearly" by giving models a pointing ability—realized through extreme token efficiency, visual‑primitive chain‑of‑thought, and a specialized expert‑training pipeline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSeek benchmark Multimodal Chain-of-Thought token efficiency Visual Primitives

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.