How DeepSeek’s Visual‑Primitive Paradigm Redefines Multimodal Reasoning
DeepSeek has released a multimodal model built on a visual‑primitive reasoning paradigm that treats coordinates and bounding boxes as reasoning units, dramatically compresses visual tokens, and achieves state‑of‑the‑art performance on counting, spatial, and topological tasks, while exposing current limits of multimodal inference.
Background
Existing multimodal large models can perceive images but often fail to reason precisely about visual objects, leading to reference errors in tasks such as counting people in dense crowds or locating components in complex circuit diagrams. DeepSeek calls this the Reference Gap and proposes a complete solution.
Architecture
The model builds on DeepSeek V4‑Flash (284 B total parameters, 13 B active during inference) as the language backbone and a proprietary Vision Transformer that accepts arbitrary resolutions. The core contribution is a training philosophy that teaches the model to reference visual objects with minimal visual tokens.
Core Innovation 1 – Coordinates as Reasoning Units
During inference the model interleaves <|ref|>, <|box|> and <|point|> primitives directly into the chain‑of‑thought. Example:
Scanning the image for a bear, found a <|ref|> bear <|/ref|> <|box|>[[452,23,804,411]]<|/box|> , it is climbing a tree, not on the ground, discard. Then look left‑bottom, find another <|ref|> bear <|/ref|> <|box|>[[50,447,647,771]]<|/box|> , standing on a rock edge, matches the condition.
These primitives act as anchors that keep the logical chain tied to precise image locations, preventing drift.
Core Innovation 2 – 7 056× Visual Compression
For a 756×756 image the ViT produces 2 916 patch tokens, which are spatially compressed 3×3 to 324 tokens before entering the language model. The built‑in Compressed Sparse Attention (CSA) reduces the key‑value cache fourfold, leaving only 81 visual KV entries. Overall pixel‑to‑cache compression ratio is 7 056×. Compared with Claude Sonnet 4.6 (≈ 870 KV entries) and Gemini‑3‑Flash (≈ 1 100 KV entries), DeepSeek needs only ~90 entries for an 800×800 image.
Core Innovation 3 – Cold‑Start Data
~100 k detection datasets were scraped, filtered to ~31.7 k high‑quality sources, and used to generate >40 M training samples across four task families:
Counting (coarse and fine‑grained)
Spatial reasoning & visual QA (GQA, CLEVR)
Maze navigation (~460 k samples) with DFS, Prim, Kruskal‑generated mazes, including “surface‑solvable but actually unsolvable” cases
Path tracing (~125 k samples) requiring disambiguation of intersecting curves without color cues
Training Procedure – Separate Experts, Then Merge
Train two expert models—one on bounding‑box data (FTwG) and one on point‑coordinate data (FTwP).
Apply reinforcement learning with the GRPO algorithm, using three parallel rewards: format correctness, quality (LLM‑judged consistency), and task‑specific accuracy.
Perform unified reinforcement fine‑tuning (Unified RFT) on rollout data from both experts, then re‑initialize from the pre‑trained model to obtain a unified model F.
Use on‑policy distillation to bridge any performance gap between the unified student and the experts.
Experimental Results
Evaluated on 11 benchmarks against Gemini‑3‑Flash, GPT‑5.4, Claude Sonnet 4.6, Gemma‑4‑31B, and Qwen‑3‑VL‑235B. Highlights:
Pixmo‑Count (exact match): 89.2 % (first), beating Gemini‑3‑Flash 88.2 % and GPT‑5.4 76.6 %.
Fine‑grained counting (DS_Finegrained_Counting): 88.7 % (first).
Spatial reasoning benchmarks MIHBench (85.3 %) and SpatialMQA (69.4 %): top rank.
Maze navigation (DS_Maze_Navigation): 66.9 % vs. GPT‑5.4 50.6 % and other models ≈ 49 %.
Path tracing (DS_Path_Tracing): 56.7 % vs. GPT‑5.4 46.5 %.
All frontier models still struggle with topological reasoning.
Limitations
The visual‑primitive mechanism requires an explicit trigger word; the model cannot autonomously decide when to use it.
At very high resolutions primitive coordinates may be imprecise; combining with high‑resolution perception methods is a natural next step.
Cross‑scene generalization of point‑coordinate reasoning remains limited.
Conclusion
The core bottleneck of multimodal reasoning is ambiguous language reference rather than insufficient visual perception. By replacing vague textual references with precise spatial anchors, DeepSeek provides a complementary path to larger models and higher resolutions, introducing a “point‑with‑the‑finger” reasoning style.
Project repository: https://github.com/deepseek-ai/Thinking-with-Visual-Primitives
Technical report: https://github.com/deepseek-ai/Thinking-with-Visual-Primitives/blob/main/Thinking_with_Visual_Primitives.pdf
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
