DeepSeek Unveils Open‑Source Multimodal Model: “Thinking with Visual Primitives”
DeepSeek releases an open‑source multimodal LLM that introduces a visual‑primitive framework—elevating bounding boxes and points to token level—to close the reference gap, achieve extreme KV‑cache compression, and outperform GPT‑5.4, Claude‑Sonnet‑4.6 and Gemini‑3‑Flash on counting, spatial reasoning, maze navigation and path‑tracing benchmarks.
1. Problem Redefinition: From Perception Gap to Reference Gap
Current multimodal large language models (MLLMs) perform Chain‑of‑Thought reasoning almost entirely in language space. Even when cutting‑edge models solve the Perception Gap with high‑resolution cropping or dynamic tiling, they still suffer logical collapse on dense counting, topological navigation and multi‑step spatial reasoning.
The authors identify a deeper bottleneck called the Reference Gap: natural language is fuzzy and continuous, while visual space is precise and discrete. Describing an object such as “the second red object on the left” discards the exact spatial anchor, causing the reasoning chain to drift from the image and produce cascading hallucinations.
Natural language is inherently vague, whereas visual space is exact; describing an object with words discards precise spatial anchors, leading to hallucinations.
Humans resolve this by pointing with a finger, anchoring abstract semantics to concrete coordinates and reducing working‑memory load.
The paper proposes “Thinking with Visual Primitives”: elevate bounding boxes and points to the same level as language tokens and interleave them directly into the model’s reasoning trace, enabling the model to “point while it reasons”.
2. Architecture and Training Pipeline: Balancing Efficiency and Specialized Capability
2.1 Architecture Design
The model follows a LLaVA‑style architecture. DeepSeek‑V4‑Flash (284 B total parameters, 13 B MoE activation parameters) serves as the language backbone, while the visual encoder is the proprietary DeepSeek‑ViT , which supports arbitrary‑resolution inputs.
Key compression techniques:
14×14 Patch Embedding : split the image into basic patches.
3×3 Spatial Compression : compress nine adjacent patches into one token along the channel dimension.
Compressed Sparse Attention (CSA) : further compress visual tokens in the LLM KV‑Cache layer.
Example: a 756×756 image (571 536 pixels) → ViT produces 2 916 patch tokens → 3×3 compression yields 324 tokens → CSA leaves only 81 visual KV entries, an overall compression ratio of 7 056 : 1.
2.2 Five‑Stage Post‑Training Process
Pretraining : pre‑train on tens of trillions of multimodal tokens to give the model basic visual‑primitive generation ability.
Specialized SFT : build cold‑start datasets for Box (FTwG) and Point (FTwP) tasks, fine‑tune them separately to avoid modality conflict.
Specialized RL : apply GRPO reinforcement learning to each expert model with three reward components—format, quality, accuracy.
Unified RFT : generate rejection‑sampling data from the two experts and train a single fused model.
On‑Policy Distillation : distill the expert output distributions into the unified model via reverse KL.
3. Cold‑Start Data Construction: Fine‑Grained Design of Four Reasoning Scenarios
3.1 Counting
MLLMs fail at dense counting because they cannot map language numbers to visual entities one‑to‑one.
Coarse‑grained counting : the model first parses intent, then performs batch grounding (boxes all candidates) and finally sums the detections. In a team‑photo example the model boxes 25 people at once and verifies the count.
Fine‑grained counting : based on GQA scene graphs, construct attribute‑constrained queries (e.g., “how many bears are on the ground?”). The model enumerates each candidate and discards those violating the attribute.
3.2 Spatial Reasoning & General VQA
Using GQA and CLEVR, the model must perform multi‑hop logical reasoning such as “does a purple rubber object of the same size as the gray metal sphere exist?”. Each step is anchored with a token sequence <|ref|>...<|/ref|><|box|>...<|/box|> that ties the mentioned object to image coordinates, preventing semantic drift.
3.3 Maze Navigation
To test topological reasoning, the authors generate rectangular, circular, and hexagonal mazes with DFS, Prim and Kruskal algorithms, inserting deliberately unsolvable sections. The model records each step with <|point|>[[x,y]]<|/point|>, producing a human‑like trial‑and‑error DFS trace.
3.4 Path Tracing
In tangled Bézier curves the model must follow a specified line to its endpoint. The challenge is disambiguating intersections; the model must rely on local geometric continuity rather than color cues. The reasoning chain records a dense coordinate sequence at crossings and a sparse sequence on straight segments.
4. Reward Model Design: Teaching RL to “Understand” Visual Reasoning
During the Specialized RL stage, task‑specific Accuracy reward models are introduced.
Counting : exponential‑decay reward based on relative error, $R = \alpha \cdot \exp(-\beta \cdot \frac{...})$.
Spatial Reasoning / VQA : LLM‑based GRM that scores both the reasoning chain and the final answer and averages them.
Maze Navigation : four‑dimensional weighting—causal exploration progress (cut off at first wall hit), completeness (unsolvable mazes), wall‑hit penalty, final path validity.
Path Tracing : bidirectional trajectory alignment, endpoint precision, continuity penalty (no jumps).
5. Quantitative Results
Counting : Pixmo‑Count reaches 89.2 % accuracy, surpassing all competitors; on CountQA the model attains RA@10 = 74.1 %, second only to Gemini‑3‑Flash.
Spatial Reasoning : DS_Spatial_Reasoning scores 98.7 %, well ahead of Claude (97.2 %) and Qwen3‑VL (96.8 %).
Topological Reasoning : DS_Maze_Navigation achieves 66.9 % (next best 50.6 %); DS_Path_Tracing reaches 56.7 % (next best 46.5 %).
6. Qualitative Analysis: How Visual Primitives Reshape the Reasoning Experience
6.1 Bounding Boxes as Primitives
World‑knowledge fusion : when shown a photo of the Golden Gate Bridge, the model boxes the bridge, links it to San Francisco, and correctly answers that the nearby NBA team is the Golden State Warriors.
Counterfactual reasoning : for the question “which side of the scale is heavier?”, the model boxes both pans and uses the visual tilt angle to overturn the intuitive answer.
Actionable suggestions : for “how to make a latte”, the model boxes the espresso machine, steam wand, milk pitcher, coffee beans and cup, then provides step‑by‑step instructions with spatial coordinates.
6.2 Points as Primitives
In maze navigation and path tracing, the model outputs sequences of points that form a visualizable reasoning path. Humans can replay these coordinates to see where the model branched, hit dead ends, or back‑tracked, offering interpretability unavailable to pure language CoT.
https://github.com/deepseek-ai/Thinking-with-Visual-Primitives/blob/main/Thinking_with_Visual_Primitives.pdfSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
