Artificial Intelligence 12 min read

How DeepSeek’s Visual‑Primitive Paradigm Redefines Multimodal Reasoning

DeepSeek has released a multimodal model built on a visual‑primitive reasoning paradigm that treats coordinates and bounding boxes as reasoning units, dramatically compresses visual tokens, and achieves state‑of‑the‑art performance on counting, spatial, and topological tasks, while exposing current limits of multimodal inference.

Machine Heart

Apr 30, 2026

How DeepSeek’s Visual‑Primitive Paradigm Redefines Multimodal Reasoning

Background

Existing multimodal large models can perceive images but often fail to reason precisely about visual objects, leading to reference errors in tasks such as counting people in dense crowds or locating components in complex circuit diagrams. DeepSeek calls this the Reference Gap and proposes a complete solution.

Architecture

The model builds on DeepSeek V4‑Flash (284 B total parameters, 13 B active during inference) as the language backbone and a proprietary Vision Transformer that accepts arbitrary resolutions. The core contribution is a training philosophy that teaches the model to reference visual objects with minimal visual tokens.

Core Innovation 1 – Coordinates as Reasoning Units

During inference the model interleaves <|ref|>, <|box|> and <|point|> primitives directly into the chain‑of‑thought. Example:

Scanning the image for a bear, found a <|ref|> bear <|/ref|> <|box|>[[452,23,804,411]]<|/box|> , it is climbing a tree, not on the ground, discard. Then look left‑bottom, find another <|ref|> bear <|/ref|> <|box|>[[50,447,647,771]]<|/box|> , standing on a rock edge, matches the condition.

These primitives act as anchors that keep the logical chain tied to precise image locations, preventing drift.

Core Innovation 2 – 7 056× Visual Compression

For a 756×756 image the ViT produces 2 916 patch tokens, which are spatially compressed 3×3 to 324 tokens before entering the language model. The built‑in Compressed Sparse Attention (CSA) reduces the key‑value cache fourfold, leaving only 81 visual KV entries. Overall pixel‑to‑cache compression ratio is 7 056×. Compared with Claude Sonnet 4.6 (≈ 870 KV entries) and Gemini‑3‑Flash (≈ 1 100 KV entries), DeepSeek needs only ~90 entries for an 800×800 image.

Core Innovation 3 – Cold‑Start Data

~100 k detection datasets were scraped, filtered to ~31.7 k high‑quality sources, and used to generate >40 M training samples across four task families:

Counting (coarse and fine‑grained)

Spatial reasoning & visual QA (GQA, CLEVR)

Maze navigation (~460 k samples) with DFS, Prim, Kruskal‑generated mazes, including “surface‑solvable but actually unsolvable” cases

Path tracing (~125 k samples) requiring disambiguation of intersecting curves without color cues

Training Procedure – Separate Experts, Then Merge

Train two expert models—one on bounding‑box data (FTwG) and one on point‑coordinate data (FTwP).

Apply reinforcement learning with the GRPO algorithm, using three parallel rewards: format correctness, quality (LLM‑judged consistency), and task‑specific accuracy.

Perform unified reinforcement fine‑tuning (Unified RFT) on rollout data from both experts, then re‑initialize from the pre‑trained model to obtain a unified model F.

Use on‑policy distillation to bridge any performance gap between the unified student and the experts.

Experimental Results

Evaluated on 11 benchmarks against Gemini‑3‑Flash, GPT‑5.4, Claude Sonnet 4.6, Gemma‑4‑31B, and Qwen‑3‑VL‑235B. Highlights:

Pixmo‑Count (exact match): 89.2 % (first), beating Gemini‑3‑Flash 88.2 % and GPT‑5.4 76.6 %.

Fine‑grained counting (DS_Finegrained_Counting): 88.7 % (first).

Spatial reasoning benchmarks MIHBench (85.3 %) and SpatialMQA (69.4 %): top rank.

Maze navigation (DS_Maze_Navigation): 66.9 % vs. GPT‑5.4 50.6 % and other models ≈ 49 %.

Path tracing (DS_Path_Tracing): 56.7 % vs. GPT‑5.4 46.5 %.

All frontier models still struggle with topological reasoning.

Limitations

The visual‑primitive mechanism requires an explicit trigger word; the model cannot autonomously decide when to use it.

At very high resolutions primitive coordinates may be imprecise; combining with high‑resolution perception methods is a natural next step.

Cross‑scene generalization of point‑coordinate reasoning remains limited.

Conclusion

The core bottleneck of multimodal reasoning is ambiguous language reference rather than insufficient visual perception. By replacing vague textual references with precise spatial anchors, DeepSeek provides a complementary path to larger models and higher resolutions, introducing a “point‑with‑the‑finger” reasoning style.

Project repository: https://github.com/deepseek-ai/Thinking-with-Visual-Primitives

Technical report: https://github.com/deepseek-ai/Thinking-with-Visual-Primitives/blob/main/Thinking_with_Visual_Primitives.pdf

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSeek multimodal models AI reasoning visual primitives compressed sparse attention reference gap V4-Flash

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Architecture

Core Innovation 1 – Coordinates as Reasoning Units

Core Innovation 2 – 7 056× Visual Compression

Core Innovation 3 – Cold‑Start Data

Training Procedure – Separate Experts, Then Merge

Experimental Results

Limitations

Conclusion

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Core Innovation 1 – Coordinates as Reasoning Units

Core Innovation 2 – 7 056× Visual Compression

Core Innovation 3 – Cold‑Start Data